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Foreword 



ETAPS 2003 was the sixth instance of the European Joint Conferences on Theory 
and Practice of Software. ETAPS is an annual federated conference that was esta- 
blished in 1998 by combining a number of existing and new conferences. This year 
it comprised five conferences (EOSSACS, EASE, ESOP, CC, TAG AS), 14 sa- 
tellite workshops (AVIS, CMCS, COCV, EAMAS, Eeyerabend, EICS, LDTA, 
RSKD, SC, TACoS, UniGra, USE, WITS and WOOD), eight invited lectures 
(not including those that are specific to the satellite events), and several tuto- 
rials. We received a record number of submissions to the five conferences this 
year: over 500, making acceptance rates fall below 30% for every one of them. 
Congratulations to all the authors who made it to the final program! I hope that 
all the other authors still found a way of participating in this exciting event and 
I hope you will continue submitting. 

A special event was held to honour the 65th birthday of Prof. Wlad Turski, 
one of the pioneers of our young science. The deaths of some of our “fathers” in 
the summer of 2002 — Dahl, Dijkstra and Nygaard — reminded us that Software 
Science and Technology is, perhaps, no longer that young. Against this sobering 
background, it is a treat to celebrate one of our most prominent scientists and 
his lifetime of achievements. It gives me particular personal pleasure that we are 
able to do this for Wlad during my term as chairman of ETAPS. 

The events that comprise ETAPS address various aspects of the system de- 
velopment process, including specification, design, implementation, analysis and 
improvement. The languages, methodologies and tools which support these ac- 
tivities are all well within its scope. Different blends of theory and practice are 
represented, with an inclination towards theory with a practical motivation on 
the one hand and soundly based practice on the other. Many of the issues invol- 
ved in software design apply to systems in general, including hardware systems, 
and the emphasis on software is not intended to be exclusive. 

ETAPS is a loose confederation in which each event retains its own identity, 
with a separate program committee and independent proceedings. Its format is 
open-ended, allowing it to grow and evolve as time goes by. Contributed talks 
and system demonstrations are in synchronized parallel sessions, with invited 
lectures in plenary sessions. Two of the invited lectures are reserved for “unify- 
ing” talks on topics of interest to the whole range of ETAPS attendees. The 
aim of cramming all this activity into a single one-week meeting is to create a 
strong magnet for academic and industrial researchers working on topics within 
its scope, giving them the opportunity to learn about research in related areas, 
and thereby to foster new and existing links between work in areas that were 
formerly addressed in separate meetings. 

ETAPS 2003 was organized by Warsaw University, Institute of Informatics, 
in cooperation with the Eoundation for Information Technology Development, 
as well as: 

— European Association for Theoretical Computer Science (EATCS); 

— European Association for Programming Languages and Systems (EAPLS); 

— European Association of Software Science and Technology (EASST); and 
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Foreword 



— ACM SIGACT, SIGSOFT and SIGPLAN. 

The organizing team comprised: 

Mikolaj Bojahczyk, Jacek Ghrz^aszcz, Piotr Ghrz^astowski-Wachtel, Grze- 
gorz Grudzihski, Kazimierz Grygiel, Piotr Hoffman, Janusz Jablonowski, 
Miroslaw Kowaluk, Marcin Kubica (publicity), Slawomir Leszczyhski (www), 
Wojciech Moczydlowski, Damian Niwihski (satellite events), Aleksy Schu- 
bert, Hanna Sokolowska, Piotr Stahczyk, Krzysztof Szafran, Marcin Szc- 
zuka, Lukasz Sznuk, Andrzej Tarlecki (co-chair), Jerzy Tiuryn, Jerzy Tysz- 
kiewicz (book exhibition), Pawel Urzyczyn (co-chair), Daria Walukiewicz- 
Ghrz^aszcz, Artur Zawlocki. 

ETAPS 2003 received support from:^ 

— Warsaw University 

— European Gommission, High-Level Scientific Gonferences and Information 
Society Technologies 

— US Navy Office of Naval Research International Eield Office, 

— European Office of Aerospace Research and Development, US Air Eorce 

— Microsoft Research 

Overall planning for ETAPS conferences is the responsibility of its Steering Gom- 
mittee, whose current membership is: 

Egidio Astesiano (Genoa), Pierpaolo Degano (Pisa), Hartmut Ehrig (Ber- 
lin), Jose Eiadeiro (Leicester), Marie-Glaude Gaudel (Paris), Evelyn Duester- 
wald (IBM), Hubert Garavel (Grenoble), Andy Gordon (Microsoft Research, 
Gambridge), Roberto Gorrieri (Bologna), Susanne Graf (Grenoble), Gorel 
Hedin (Lund), Nigel Horspool (Victoria), Kurt Jensen (Aarhus), Paul Klint 
(Amsterdam), Tiziana Margaria (Dortmund), Ugo Montanari (Pisa), Mo- 
gens Nielsen (Aarhus), Hanne Riis Nielson (Gopenhagen), Eernando Orejas 
(Barcelona), Mauro Pezze (Milano), Andreas Podelski (Saarbriicken), Don 
Sannella (Edinburgh), David Schmidt (Kansas), Bernhard Steffen (Dort- 
mund), Andrzej Tarlecki (Warsaw), Igor Walukiewicz (Bordeaux), Herbert 
Weber (Berlin). 

I would like to express my sincere gratitude to all of these people and organizati- 
ons, the program committee chairs and PG members of the ETAPS conferences, 
the organizers of the satellite events, the speakers themselves, and Springer- 
Verlag for agreeing to publish the ETAPS proceedings. The final votes of thanks 
must go, however, to Andrzej Tarlecki and Pawel Urzyczyn. They accepted the 
risk of organizing what is the first edition of ETAPS in Eastern Europe, at a 
time of economic uncertainty, but with great courage and determination. They 
deserve our greatest applause. 



Leicester, January 2003 Jose Luiz Eiadeiro 

ETAPS Steering Gommittee Ghair 



^ The contents of this volume do not necessarily reflect the positions or the policies of 
these organizations and no official endorsement should be inferred. 




Preface 



The International Conference on Compiler Construction (CC) is concerned with 
recent developments in compiler construction, programming language implemen- 
tation, and language design. It addresses work on all phases of compilation and 
for all language paradigms, emphasizing practical and efficient methods and 
tools. The broad area of compiler construction is reflected in these proceedings. 
The papers cover the full range of compiler topics including compiler tools, par- 
sing, type analysis, static analysis, code optimization, register allocation, and 
run-time issues. 

CC 2003 was held in Warsaw, Poland during 5-13 April 2003 and was the 
12th conference in the series. This year, submissions reached a record number of 
83 papers of which 77 were regular papers and 6 were short tool demonstration 
papers. Of these, 20 regular papers and one tool demonstration paper were 
selected for presentation and are included in these proceedings. 

The proceedings also include two invited papers. The CC 2003 invited spea- 
ker was Barbara Ryder, whose talk was entitled Dimensions of Precision in 
Reference Analysis of Object- Oriented Programming Languages. In addition, we 
have the honor of including the paper by Tony Hoare who gave one of the two 
ETAPS “unifying” invited talks. The title of his talk was The Verifying Compi- 
ler: a Grand Challenge for Computing Research. 

The selection of papers took place at an intense program committee meeting 
in Lund, Sweden, on December 6th, 2002. Eight of the PC members attended the 
meeting, and another seven joined in the discussion via a telephone conference 
call. I wish to thank all my colleagues on the program committee for their hard 
work, detailed reviews, and friendly cooperation. I am especially grateful to Nigel 
Horspool, Reinhard Wilhelm, and Evelyn Duesterwald, who, as members of the 
CC steering committee, gave me prompt advice whenever I needed it throughout 
the process of being the program chair. Many thanks also to the large number 
of additional reviewers who helped us read and evaluate the submitted papers. 

Many people helped me in the administration of the PC work. In particular, 
I wish to thank Jonas Wisbrant for being very helpful in the organization of 
the program committee meeting and arranging a simple but very useful web 
facility for the PC members participating via telephone. Thanks also to Christian 
Andersson who helped me assemble these proceedings, and to Tiziana Margaria 
and Martin Karusseit at METAframe for their support of the electronic online 
conference system we used for submissions and reviewing. Einally, I wish to 
thank Jose Luiz Eiadeiro and the ETAPS team for their excellent organization 
and coordination of the whole ETAPS event. 



Lund, January 2003 



Corel Hedin 
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Combined Code Motion and Register Allocation 
Using the Value State Dependence Graph 



Neil Johnson and Alan My croft 

Computer Laboratory, University of Cambridge 
William Gates Building, JJ Thompson Avenue, 
Cambridge, CB3 OFD, UK 
{Neil . Johnson, Alan. Mycroft}@cl . cam.ac.uk 



Abstract. We define the Value State Dependence Graph (VSDG). The 
VSDG is a form of the Value Dependence Graph (VDG) extended by the 
addition of state dependence edges to model sequentialised computation. 
These express store dependencies and loop termination dependencies of 
the original program. We also exploit them to express the additional 
serialization inherent in producing final object code. 

The central idea is that this latter serialization can be done incrementally 
so that we have a class of algorithms which effectively interleave register 
allocation and code motion, thereby avoiding a well-known phase-order 
problem in compilers. This class operates by first normalizing the VSDG 
during construction, to remove all duplicated computation, and then 
repeatedly choosing between: (z) allocating a value to a register, (zz) 
spilling a value to memory, (zzz) moving a loop- invariant computation 
within a loop to avoid register spillage, and {iv) statically duplicating a 
computation to avoid register spillage. 

We show that the classical two-phase approach (code motion then regis- 
ter allocation in both Ghow and Ghaitin forms) are examples of this class, 
and propose a new algorithm based on depth- first cuts of the VSDG. 



1 Introduction 

An important problem encountered by compiler designers is the phase order- 
ing problem, which can be phrased as whieh order does one sehedule the 
register alloeation and eode motion phases to give the best target eode?^\ These 
phases are antagonistic to each other — code motion may increase register pres- 
sure, while register allocation places additional dependencies between instruc- 
tions, artificially constraining code motion. In this paper we show that a unified 
approach, in which both register allocation and code motion are considered to- 
gether, sidesteps the problem of which phase to do first. 

In support of this endeavour, we present a new program representation, the 
Value State Dependenee Graph (VSDG), as an extension of the Value Depen- 
dence Graph (VDG) [2T]. It is a simple unifying framework within which a wide 
range of code space optimizations can be implemented. We believe that the 
VSDG can be used in both intermediate code transformations, and all the way 
through to final target code generation. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. iJTg] 2003. 

© Springer- Verlag Berlin Heidelberg 2003 
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Traditional register allocation has been represented as a graph colouring 
problem, originally proposed by Chaitin |3, and based on the Control Flow 
Graph (CFG). Unfortunately the CFG imposes an artificial ordering of instruc- 
tions, constraining the register allocator to the given order. 

The VDG represents programs as value dependencies — there is an edge (p, n), 
drawn as an arrow n ^ p, if node n requires the value of p to compute its own 
value. This representation removes any specific ordering of instructions (nodes), 
but does not elegantly handle loop and function termination dependencies. 

The VSDG introduces state dependency edges to model sequentialised com- 
puting. These edges also have the surprising benefit of generalising the VSDG: 
by adding sufficient serializing edges we can select any one of a number of GFGs. 
Our thesis is that relaxing the exact serialization of the GFG into the more gen- 
eral VSDG supports a combined register allocation and code motion algorithm. 



1.1 Paper Structure 

This paper is structured as follows. Section [5] describes the forms of nodes and 
edges in the VSDG, while Section E] explores additional serialization and liveness 
within the VSDG. In Section |U we describe the general approach to joint register 
allocation and code motion (RACM) as applied to the VSDG, and show that 
classical Ghaitin/Ghow-style register colouring specialises it. Sectional introduces 
our new greedy register allocation algorithm. Section [6] provides context for this 
paper with a review of related work, with Section [7| concluding. 

2 Formalism 

The Value State Dependence Graph is a directed graph consisting of operation 
nodes, loop and merge nodes together with value- and state-dependency edges. 
Gycles are permitted but must satisfy various restrictions. A VSDG represents 
a single procedure; this matches the classical GFG but differs from the VDG 
in which loops were converted to tail-recursive procedures called at the logical 
start of the loop. We justify this because of our interest in performing RAGM 
at the same time; inter-procedural motion and allocation issues are considered 
a topic for future work. 

An example VSDG is shown in Fig. [T] In (a) we have the original G source 
for a recursive factorial function. The corresponding VSDG (b) shows both value 
and state edges and a selection of nodes. 



2.1 Definition of the VSDG 

Definition 1. A VSDG is a labelled direeted graph G = (V, Fy, Vq, Voo) 
eonsisting of nodes N ( with unique entry node Nq and exit node N^o ), value- 
dependeney edges Ey C N x N , state- dependeney edges Es E N x N . The 
labelling funetion i assoeiates eaeh node with an operator for details). 
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int fac( int n ) 
int result; 

if ( n == 1 ) 
result = n; 
else 

result = n 
return result; 

} 



(a) (b) 

Fig. 1. A recursive factorial function, whose VSDG illustrates the key graph 
components — value dependency edges (solid lines), state dependency edges (dashed 
lines), a const node, a call node, two 7-nodes, a conditional node (==), and the func- 
tion entry and exit nodes. The left-hand 7-node returns the original function argument 
if the condition is true, or that of the expression otherwise. The right-hand 7-node be- 
haves similarly for the state edges, returning either the state on entry to the function, 
or that returned by the call node. 




VSDGs have to satisfy two well-formedness conditions. Firstly ^ and the 
(Ey) arity must be consistent, e.g. that a binary arithmetic operator must have 
two inputs; secondly (at least for the purposes of this paper) that the VSDG 
corresponds to a structured program, e.g. that there are no cycles in the VSDG 
except those mediated by 0 (loop) nodes (see § 3 . 2 p . 

Value dependency (Ey) indicates the flow of values between nodes, and must 
be preserved during register allocation and code motion. 

State dependency (^5), for this paper, represents two things; the first is 
essential sequential dependency required by the original program, e.g. a given 
load instruction may be required to follow a given store instruction without 
being re-ordered, and a return node in general must wait for an earlier loop to 
terminate even though there might be no value-dependency between the loop 
and the return node. The second purpose, which in a sense is the centre of 
this work, is that state-dependency edges can be added incrementally until the 
VSDG corresponds to a unique GFG f ^ 3 .ip . Such state dependency edges are 
called serializing edges. 

An edge (ni,n2) represents the flow of data or control from n\ to 722, i.e. 
in the forwards data flow direetion^ so we will see n\ as a predecessor of n2- 
Similarly we will regard U2 as a successor of n\. If we wish to be specific we 
will write V-successors or S'-successors for respectively Ey and Es successors. 
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Similarly, we will write succv{n)^ predg{n) and the like for appropriate sets 
of successors or predecessors, and dom{n) and pdom{n) for sets of dominators 
and post-dominators respectively. We will draw pictures in the VDG form, with 
arrows following the backwards data flow direction^ so that the edge (ni, 722) will 
be represented as an arrow from U2 to ni. 

The VSDG inherits from the VDG the property that a program is implicitly 
represented in Static Single Assignment (SSA) form [8]: a given operator node, 
n, will have zero or more £’y-successors using its value. Note that, in implemen- 
tation terms, a single register can hold the produced value for consumption at 
all successors; it is therefore useful to talk about the idea of an output port for 
n being allocated a specific register, r, to abbreviate the idea of r being used 
for each edge (ni,ri2) where U2 G succ{ni). Similarly, we will talk about (say) 
the “right-hand input port” of a subtraction instruction, or of the i?-input of a 
^-node. 



2.2 Node Labelling with Instructions 

There are four main classes of VSDG nodes, based on those of the triVM In- 
termediate Language m- value nodes (representing pure arithmetic), 7-nodes 
(conditionals), ^-nodes (loops), and state nodes (side-effects). 



2.2.1 Value Nodes. The majority of nodes in a VSDG generate a value based 
on some computation (add, subtract, etc) applied to their dependent values 
(constant nodes, which have no dependent nodes, are a special case). 

2.2.2 7-Nodes. Our 7- node is similar to the 7- node of Gated Single As- 
signment form [2 in being dependent on a control predicate, rather than the 
control-independent nature of SSA ^-functions. 

Definition 2. A ^-node ^{C^T^F) evaluates the condition dependency C, and 
returns the value of T if C is true, otherwise F. 

We generally treat 7-nodes as single-valued nodes (constrast ^-nodes, which are 
treated as tuples), with the effect that two separate 7-nodes with the same 
condition can be later combined (Section |4j into a tuple using a single test. 
Fig. [2] illustrates two 7-nodes that can be combined in this way. 



2 . 2.3 0 -Nodes. The 0-node models the iterative behaviour of loops, modelling 
loop state with the notion of an internal value which may be updated on each 
iteration of the loop. It has five specific ports which represent dependencies at 
various stages of computation. 

Definition 3 . A 0 -node 0 {C, I, R, L, X) sets its internal value to initial value I 
then, while condition value C holds true, sets L to the current internal value and 
updates the internal value with the repeat value R. When C evaluates to false 
computation ceases and the last internal value is returned through the X port. 



Combined Code Motion and Register Allocation 



5 



A loop which updates k variables will have: a single condition port C, initial- value 
ports /i, . . . , J/c, loop iteration ports Li, . . . , L/e, loop return ports . . . , 
and loop exit ports Xi, . . . , Xj^. The example in Fig. |3] shows a pair (2-tuple) of 
values being used for one for each loop- variant value. 

For some purposes the L and X ports could be fused, as both represent out- 
puts within, or exiting, a loop (the values are identical, while the C input merely 
selects their routing). We avoid this for two reasons: (z) we have operational se- 
mantics for VSDGs G and these semantics require separation of these concerns; 
and (zz) our construction of Q'>^oioop 2 )) requires it. 

The ^-node directly implements pre-test loops (while, for); post-test loops 
(do. . .while, repeat. . .until) are synthesised from a pre-test loop preceded 
by a duplicate of the loop body. At first this may seem to cause unnecessary 
duplication of code, but it has two important benefits: (z) it exposes the first 
loop body iteration to optimization in post-test loops (c/. loop-peeling), and (zz) 
it normalizes all loops to one loop structure, which both reduces the cost of 
optimization, and increases the likelihood of two schematically-dissimilar loops 
being isomorphic in the VSDG. 



2.2.4 State Nodes. Loads, stores, and their volatile equivalents, compute 
a value and/or state (non-volatile loads return a value from memory without 
generating a new state). Accesses to volatile memory or hardware can change 
state independently of compiler- aware reads or writes (c/. lO-state [2j). 

The call node takes both the name of the function to call and a list of 
arguments, and returns a list of results; it is treated as a state node as the 
function body may read or update state. 

We maintain the simplicity of the VSDG by imposing the restriction that 
all functions have one return node (the exit node Voo), which returns at least 
one result (which will be a state value in the case of void functions). To ensure 
that function calls and definitions are colourable, we suppose that the number of 
arguments to, and results from, a function is smaller than the number of physical 
registers — further arguments can be passed via a stack as usual. 



a) if (P) 

X = 2, y = 3; 

else 







II 


= 5; 


b) 


if 


(P) X = 2; 


else X = 4 




if 


(P) y = 3; 


else y = 5 




Fig. 2. Two different code schemes (a) V (b) map to the same 7-node structure. 
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j = • • • 

for( i = 0; 
i < 10; 
++i ) 

— j; 

. . . = j; 




Fig. S. AO -node example showing a for loop. Evaluating the ^-node’s X port triggers 
it to evaluate the I value (outputting the value on the L port). While C evaluates to 
true, it evaluates the R value (which in this case also uses the ^-node’s L value). When 
C is false, it returns the hnal internal value through the X port. As i is not used after 
the ^-node loop then there is no dependency on the i port of X. 



Note also that the VSDG neither forces loop invariant code into nor out-of 
loop bodies, but rather allows later phases to determine, by adding serializing 
edges, such placement of loop invariant nodes for later phases. 

3 Applying the VSDG to RACM 

3.1 Serialization 

Weise et al. observe that their mapping from CFGs to VDGs is many- 
one; that paper also suggests that “Code motion optimizations are decided when 
the demand dependence graph is constructed from the VDG^^ — i.e. that a VDG 
should be mapped back into a CFG for further processing — but does not give 
an algorithm or consider which of the many CFGs corresponding to a VSDG 
should be selected. 

We identify VSDGs with ‘enough’ serializing edges with CFGs — such VSDGs 
can be simply transformed into CFGs if desired — the task of RACM then be- 
ing to make the VSDG sufficiently sequential. The following informal definition 
captures this idea for the purposes of this paper. 

Definition 4 . A sequential VSDG is one which has enough serializing edges to 
make it correspond to a single GFG. 

Here ‘enough’ means in essence that each node in the VSDG has a unique {Ey U 
Es) immediate dominator which can be seen as its predecessor in the CFG. 
Exceptions arise for the start node (which has no predecessors in the VSDG or 
corresponding CFG), 7- nodes and d-nodes. Given a 7- node, we interpret those 
nodes which the T port post-dominates as the condition-true sub-CFG and those 
which the E port post-dominates as the condition- false sub-CFG; a control- 
split node (corresponding to a CFG test node) is added to the VSDG as the 
immediate F^s'-dominator of both sub-CFGs. For a ^-node, we recursively require 
this sequential property for its body, L, and interpret the “unique immediate 
dominator” property as a constraint on its / port. 
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3.2 VSDG Well-Formedness 

As in the VDG we restrict attention to reducible graphs for the VSDG; recall that 
any GFG can be made reducible by duplicating code or by introducing additional 
boolean variables. For this paper we further restrict to programs whose only loops 
are while-loops and which exit them only by the condition becoming false, i.e. 
no break and the like (again this could be achieved with code duplication or 
with additional boolean variables). 

In order to specify the “all cycles in a VSDG are mediated by 6 >- nodes” 
restriction, it is convenient to define a transformation on VSDGs. 

Definition 5. Given a VSDG, G, we define to be identieal to G exeept 

that eaeh 0-node Oi is replaeed with two nodes, and edges to or from 

ports I or L of the original 0-node are redireeted to 0^^^^ whereas those to or 
from ports R, X, G are redireeted to Oj^'^K 

We then require to be an acyclic graph. 

When adding serializing edges we must maintain this acyclic property. To 
serialize nodes connected to a 6 >-node’s X port we add serializing edges to 0^^'^^; 
all nodes within the body of the loop are on the sequential path from 0^^'^^ to 
Qhead.^ all other nodes are serialized before 0^^^^ . Definition E] below sets out the 
conditions for a node to be within a loop. 

Although this is merely a formal transformation, note that if we interpret 
0^^^^ as a 7 - node (or possibly a tuple thereof) and interpret 0^^^^ as an identity 
operation then represents a VSDG in which each loop is executed zero 

or one times according to the condition. Our 0^^^^ and 0^^'^^ nodes, while similar 
to GSA’s /i- and ? 7 ^-functions [2], avoids the need for their “non-deterministic 
merge gate” to break cyclic dependencies. 

The formal definition of a VSDG being well-formed is then: 

Definition 6. A VSDG, G, is well-formed if (i) is aeyelie and (\\) 

for eaeh pair of (0^^^^ , 0^^'^^) nodes in , 0^^'^^ post- dominates all nodes in 

The second condition says that no value computed during a loop can be used 
outside the loop, except via the X port of a ^-node. 

3.3 VSDG Normalization 

The RACM algorithms below will assume (for maximal optimization potential 
rather than correctness) that the VSDG has been normalized, roughly in the 
way of ‘hash-CONSing’: any two nodes which have identical input nodes, will 
be assumed to have been replaced with a single node provided that this does not 
violate well-formedness by ereating a eyele in the VSDG. Gonsider 

int f ( int v [] , int i ) { 
int a = v[i+l] ; 
v[7] =0; 

return v[i+l] + a; 



} 
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There will only be one node for the constant 1, and one for the addition of this 
node to the second formal parameter (i+l) but two nodes for the loading from 
v[i+l] because sharing this node would lead to a cycle in Es by being both a 
predecessor and successor of the store to v[7] . 

Note that this is a safe form of CSE and loop invariant code lifting; this 
optimization is selectively undone (node cloning) during the joint RACM phase 
when required by register pressure. 



3.4 Liveness in VSDGs 

For the purposes of register allocation (c/. the register interference graph), we 
need to know which (output ports of) VSDG nodes may hold values simultane- 
ously so we can forbid them being allocated the same register. 

We define a cut to be a partition Ni U N 2 of nodes in the VSDG with the 
property that there is no Ey U Es edge from N 2 to Ni (excepting edges from L 
ports of 6>- nodes — see the construction). 

We now define nodes n and n' to interfere if there is a cut Ni U N 2 with 
n,n' G Ni and with both succ{n) and succ{n') having non-empty intersections 
with N 2 . 

This generalises the normal concept register of interference in a GFG; there 
a cut is just a program point and interference means “simultaneously live at any 
program point” . Similarly “virtual register” corresponds to our “output port of 
a node” . Note that we use the concept of “cut based on Depth From Root” in 
Section 0 for our new greedy algorithm. 

4 Register Allocation and Code Motion 

The goal of register allocation in the VSDG is to allocate one physical register 
(from a fairly small set) to each node’s output ports. 6>-nodes are a special case, 
as they require multiple registers on their tupled /, R, L and X ports. 

Register requirements can be reduced by serializing computations (a register 
can be reused in two independent computations if we know that they do not 
interleave), or by reducing the range over which a value is live by duplicating 
a computation or by spilling a value to memory. In both cases the idea is that 
these operations reduce the register interference. 



4.1 A Non-deterministic Approach 

Given a VSDG we repeatedly apply the following non-deterministic algorithm 
until all the nodes are coloured and the VSDG is sequential: 

1. Golour a port with a physical register — provided no port it interferes with 
is already coloured with the same register; 

2. Add a serializing edge to force one node before another — this removes edges 
from the interference graph by forbidding interleaving of computations; 
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3. Clone a node, i.e. recalculate it to reduce register pressure. 

4. Tunnel values through memory by introducing store/load spill nodes. 

5. Merge two 7 - nodes a and b into a tuple, provided their C ports reference the 
same node and there is no path from a to 6 or from 6 to a. 

The first action assigns a physical register to a port of the given node. The second 
moves the node, with the effect of changing the register usage; the choice of which 
node to move is determinined by specific algorithms (see ^4.21 and Section |5j . 

Node cloning replaces a single instance of a node that has multiple uses, 
with multiple copies (clones) of the node, each with a subset of the original 
dependency edges. For example, a node n with two dependent nodes p and 
can be cloned into n' and n", with p dependent on n' and q dependent on n" . 

Spilling follows the traditional Chaitin-style register spilling where we add 
store and load nodes, together with some temporary storage in memory. 

Finally, because the initial VSDG was normalized to ensure that each 7 - node 
represented the merge of a single variable, given a VSDG such as that in Fig. [ 2 ] 
we can either arrange to serialize the two 7 - nodes (action 2 ) resulting in two 
separate tests (or conditional move instructions) or to merge them (action 5) so 
that a single test is used (as in Fig. EJa)). 

The cost of spilling loop-variant variables is rather higher than the store-and- 
reload for a normal spill. For ^-nodes where the tuple is wider than the available 
target registers, we must spill one or more of the ^-node variables over the loop 
test code, not merely within the loop itself. At most this requires two stores and 
three loads for each variable spilled. Fig. |4] shows the location of the five spill 
nodes (a), with table (b) describing the use of each of the spill nodes. 

4.2 The Classical Algorithms 

We can phrase the classical Ghaitin/Chow-style register allocators as instances 
of the above algorithm: 

1. Perform all code motion transforms through adding serializing edges and 
merging 7 -nodes if not already sequent ialised; 




Spill Node 


Needed if variable... 


A 


Is initialised 


B 


Is used in Loop Body 


C 


Is defined in Loop Body 


D 


Is used after the loop 


E 


Is used in the condition 
predicate P 



(b) 



Fig. 4. Illustrating the locations of the five spill nodes associated with a ^-node. 



10 



N. Johnson and A. Mycroft 



2. Map the VSDG onto a CFG by adding additional serializing edges; 

3. If there are insufficient physical registers to colour a node port, then: 

a) Chaitin-style allocation |5]: spill nodes, with the restriction that the tar- 
get register of the reload is the same as the source register of the store. 
Chaitin’s cost estimates can be applied to determine which edge to spill; 

b) Chow-style allocation [S]: spill nodes, but without the register restriction 
of Chaitin-style, thus splitting the live-range of the virtual register; use 
Chow’s heuristics to decide which edge to split. 

In both Chaitin and Chow instances post-code-motion transformations dur- 
ing register allocation are limited to inserting store and load nodes into the 
program. 

5 A New Register Allocation Algorithm 

The Chaitin/Chow algorithms do not make full use of the dependence informa- 
tion within the VSDG; they assume that a previous phase has performed code 
motion to produce a sequential VSDG — corresponding to a single CFG — on 
which traditional register colouring algorithms are applied. 

We now present the central point of this paper — a register allocation al- 
gorithm specifically designed to maximise the usage of information within the 
VSDG. The algorithm consists of two distinct phases: 

1. Starting at the exit node Vo©, walk up the graph edges calculating the max- 
imal Depth From Root (DFR) of each node (see Definition Cj) ; for each set of 
nodes of equal depth calculate their liveness width (the number of distinct 
values on which they depend, taking into account control flow). 

2. Apply a forward “snow-plough’l^-like graph reshaping algorithm, starting 
from Voo and pushing up towards Vq, to ensure that the liveness width is 
never more than the number of physical resisters. This is achieved by split- 
ting, spilling or adding serializing edges in greedy way so that the previously 
smoothed-out parts of the graph (nearer the exit) are not re-visited. 

The result is a colourable VSDG; colouring it constitutes register assignment 
completing the algorithm. 

5.1 Partitioning the VSDG 

The first phase annotates the VSDG with the maximal Depth From Root. The 
second phase then processes each cut of the VSDG in turn. 

Definition 7. The maximal Depth From Root, V{n), of a node n e N is the 
length of the longest path p G {Ey U EsY from the root to n. Loop bodies are 
traversed onee, sueh that a 0-node has two DFRs — one eaeh for the and 

Qtail 



^ Imagine a snow plough pushing forward, scooping up excess snow, and depositing it 
where there is little snow. The goal is to even out the peaks and troughs. 
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Definition 8. A depth- first cut S=d is the set of nodes with the same DFR d: 

S=d = {neN I V{n) = d} 

It is convenient also to write 

S<d = {n e N \ V{n) < d} 

S>d = {n e N \ V{n) > d} 

Note that the partition {S<d,Syd) is a eut according to the definition of ^3.4[ 
Computing the DFR of a given VSDG is equivalent to computing the depth- 
first search of the graph — we simply start at the root node A^oo and recursively 
walk along all dependency edges, setting each node to the larger of the node’s 
current DFR and the new DFR, terminating either at the entry node Nq or nodes 
with DFRs greater than the current DFR. It has a complexity of O(NfEv^Es). 



5.2 Calculating Liveness Width 

We wish to transform each cut so that the number of nodes having edges passing 
through it is no greater than 7^, the number of registers available for colouring. 
For a cut of depth d the set of such live nodes is given by 



Wm{d) = S>d n predy{S<d) 



i.e. those nodes which are further than d from the exit node but whose values 
may be used on the path to the exit node. Note that only Ey and not Es edges 
count towards liveness. 

One might expect that \yVin{d)\ is the number of registers required to com- 
pute the nodes in S<d but this overstates the number of registers required for 
conditional nodes, y-nodes have the property that the edges of each of their 
selection dependencies are disjoint — on any given execution trace, exactly one 
path to the y-node will be executed at a time, and so therefore we can reuse the 
same registers to colour its True- and False-dominated nodes. 

We identify the y-node dependency register sets using the dominance prop- 
erty thus: 

Definition 9. A node n £ N is a predicated node iff it is post- dominated by 
either the True or the False port of a ^-node, but not by both. 

Note that replacing nodes in either of the True or False regions with no-ops each 
gives a lower-bound to the liveness width of the cufEl Moreover, the greater of 
the liveness widths for these modified VSDGs gives the corrected liveness width 
for the original VSDG. 

We prefer to formulate this in constraint form. 

^ Such no-ops are nodes with no value dependencies on input or output, but with 
state-dependences where previously there was either a value or state edge so that 
the DFR is not affected. 
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Definition 10. A VSDG is colourable with 7Z registers if either: 

1. Every eut of depth d has \yVin{d)\ < IZ; or 

2. Eaeh VSDG resulting from replaeing either True or False regions with no-ops 
satisfies 1. 

5.3 Pass through Edges 

Some edges {i.e. of lifetime greater than one) pass through a cut. These pass- 
through {PT) edges may also interact with the cut. However, even the ordinary 
PT edges require a register, and so must be accommodated in any colouring 
scheme. 

Definition 11. The lifetime C of an edge (n, n') is the number of euts over 
whieh it spans: 

C{n, n') = V{n) — V{n') 

Definition 12. An edge (n,n') G Ey is a Pass Through (PT) edge over eut S 
of depth d when: 

V{n) > d > V{n') 

A Used Pass Through (UPT) edge is a PT edge from a node whieh is also used 
by one or more nodes in S, i.e. there is n" G S with (n,n") G Ey. 

In particular, PT (and to a lesser extent UPT) edges are ideal candidates for 
spilling when transforming a cut. The next section discusses this further. 

5.4 Register Allocation 

In order to colour the graph successfully with IZ target machine registers no cut 
of the graph must be wider {i.e. the number of live registers) than the number 
of target registers available. 

For every cut of depth d calculate Win{d). Then, while 7Z > \Win{d)\ we 
apply three transformations to the VSDG in increasing order of cost: (i) node 
raising (code motion), (ii) node cloning (undoing CSE), or (Hi) node spilling, 
where we first choose non- loop nodes followed by loop nodes. 

The first — node raising — pushes a node up to the next cut by adding serializ- 
ing edges from all other nodes in the cut. We repeat this until either the liveness 
width is less than the number of physical registers, or there is only one node left 
in the cut. 

In node cloning, we take a node and generate copies (clones). Serializing 
edges are added to maintain the DFR of the the clones. A simple algorithm for 
this transformation is to produce as many clones as there are dependents of the 
node; a node recombining pass will recombine clones that end up in the same 
cut. 

Node cloning is not always applicable as it may increase the liveness width 
of higher cuts (when the in-registers of the cloned node did not previously pass 
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through the cut); placing a cloned node in a lower cut can increase the liveness 
width. But, used properly [18], node cloning can reduce the liveness width of 
lower cuts by splitting the live range of a node, which potentially has a lower 
cost than spilling. 

Finally, when all other transformations are unable to satisfy the constraint, 
we must spill one or more edges to memory. PT edges are ideal candidates for 
spilling, as the lifetime of the edge affords good pipeline behaviour on superscalar 
RISC-like targets; likewise, UPT edges are similarly beneficial, but place some 
constraints on the location of the store node. 

A related issue is the spilling of nodes. As discussed previously the worst- 
case cost of spilling a loop-variant variable from a 6>-node tuple is two stores 
and three loads, so these should always be done after spilling of PT nodes. 
By contrast, Chaitin/Chow colouring has to use approximate cost heuristics to 
decide to spill a variable in a loop or outside. 

6 Related Work 

6.1 Benefits over Other Program Graph Representations 

The VSDG is based in part on the Value Dependence Graph (VDG) [2T|. The 
VDG uses a A- node to represent both functions and loop bodies, thereby com- 
bining loops and functions into one complex abstraction mechanism rather. In 
the VSDG we treat them separately with call and ^-nodes. One particular 
problem the VDG has is that of preserving the terminating properties of a 
program — Evaluation of the VDG may terminate even if the original program 
would not..V ^T\ . 

Another significant issue with the VDG is the process of generating target 
code from the VDG. The authors describe converting the VDG into a demand- 
based Program Dependence Graph (dPDG) — a normal Program Dependence 
Graph |2] with additional edges representing demand dependence — then convert- 
ing that into a traditional control flow graph (GFG) [I] before finally generating 
target code from the CFG with a standard back-end code generator; this is not 
as flexible (or as clearly specified) as the VSDG presented in this paper. 

Many other program graphs (with many and varied edge forms) have been 
presented in the literature: the Program Dependence Graph [H], the Program 
Dependence Web |2], the System Dependence Graph m and the Dependence 
Flow Graph m- Our VSDG is both simpler — only two types of edge represent 
all of the above flow information — and more normalizing f fe.dj) . 

6.2 Solving Phase Order Problems 

The traditional view of register allocation as a graph colouring problem was 
proposed by Chaitin [5|. In ^4.21 we generalise the both the Chaitin and Chow 
approaches. 

Goodwin and Wilken m formulate global register allocation (including all 
possible spill placements) as a 0-1 integer programming problem. While they do 
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achieve quite impressive results, the cost is very high: the complexity of their 
algorithm is O(n^) and for a given time period the allocator does not guarantee 
to allocate all functions. 

Code motion as an optimization is not new {e.g. Partial Redundancy Elim- 
ination El)- Perhaps the work closest in spirit to ours is that of Riithing et 
al [18] which presents algorithms for optimal placement of expressions and sub- 
expressions, combining both raising and lowering of code within basic blocks. 

Most work has concentrated on the instruction scheduling/register allocation 
phase order problem, which we now consider. 

The CRAIG framework [4j, implemented within the ROCKET compiler m, 
takes a brute force approach: 

1. attempt register allocation after instruction scheduling, 

2. if the schedule cost is not acceptable (by some defined metric) attempt reg- 
ister allocation before scheduling, 

3. then while the cost is acceptable (z.e. there is some better schedule) add 
back in information from the first pass until the schedule just becomes too 
costly. 

Their experience with an instance of CRAIG (CRAIGq) defines the metric as 
the existence of spill code. Their experimental results show improvements in 
execution time, but do not document the change in code size. 

Rao improves on GRAIGq with additional heuristics to allow some 
spilling, where it can be shown that spilling has a beneficial effect. 

Touati’s thesis j20] argues that register allocation is the primary determinant 
of performance, not scheduling. The goal of his thesis is again to minimize the 
insertion of spill-code, both through careful analysis of register pressure, and 
by adding serializing edges to each basic block data dependency DAG. It is 
basic-block-based. 

An early attempt at combining register allocation with instruction schedul- 
ing was proposed by Pinter |16]. That work is based on an instruction level 
register-based intermediate code, and is preceded by a phase to determine data 
dependencies. This dependence information then drives the allocator, generating 
a Parallelizahle Interference Graph to suggest possible register allocations. Eur- 
ther, the Global Scheduling Graph is then used to schedule instructions within a 
region. 

Another region-based approach is that of Janssen and Corporaal where 
regions correspond to the bodies of natural loops. They then use this hierarchy 
of nested regions to focus register allocation, with the inner-most regions being 
favoured by better register allocation {i.e. less spill code). 

The Resource Spackling Eramework of Berson et al. [3] applies a Measure 
and Reduce paradigm to combine both phases — their approach first measures 
the resource requirements of a program using a unified representation, and then 
moves instructions out of excessive sets into resource holes. This approach is 
basic-block-based: a local scheduler attempts to satisfy the target constraints 
without increasing the execution time of a block; the more complicated global 
scheduler moves instructions between blocks. 
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7 Conclusions and Further Work 

In this paper we have defined the VSDG, an enhanced form of the VDG which in- 
cludes state dependency edges to model sequent ialized computation. By adding 
sufficient state-dependency edges we have shown that the VSDG is able to rep- 
resent a single GFG; conversely fewer serializing edges relax the artificial con- 
straints imposed by the GFG. 

From this basis, we have shown that the VSDG framework supports a com- 
bined approach to register allocation and code motion, using an incremental 
algorithm which effectively interleaves the two phases, and thus avoiding the 
well-known phase-ordering problem. We have described an algorithm which, 
when given a well-formed, normalized VSDG then allocates registers, if nec- 
essary interleaving this with code motion, node splitting and register spilling. 

The work presented here is the start of a larger project: an implementation 
of the algorithms in this paper is in progress. 
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Abstract. The register allocation in loops is generally performed after or dur- 
ing the software pipelining process. This is because doing a conventional register 
allocation at first step without assuming a schedule lacks the information of in- 
terferences between variable lifetime intervals. Thus, the register allocator may 
introduce an excessive amount of false dependences that reduce dramatically the 
ILP (Instruction Level Parallelism). We present a new framework for controlling 
the register pressure before software pipelining. This is based on inserting some 
anti-dependences edges {register reuse edges) labeled with reuse distances, di- 
rectly on the data dependence graph. In this new graph, we are able to guarantee 
that the number of simultaneously alive variables in any schedule does not exceed 
a limit. The determination of register and distance reuse is parameterized by the 
desired critical circuit ratio (Mil) as well as by the register pressure constraints - 
either can be minimized while the other one is fixed. After scheduling, register al- 
location is done cyclically on conventional register sets or on rotating register files. 
We give an optimal exact model, and another approximative one that generalizes 
the Ning-Gao d buffer optimization heuristics. 



1 Introduction 

This article addresses the problem of register pressure in simple loop data dependence 
graphs (DDGs), with multiple register types and non unit assumed latencies operations. 
Our aim is to decouple the registers constraints and allocation from the scheduling 
process and to analyze the trade-off between memory (register pressure) and parallelism 
constraints, measured as the critical ratio M/iQ of the DDG. 

The principal reason is that we believe that register allocation is more important as 
an optimization issue than code scheduling. This is because the code performance is far 
more sensitive to memory accesses than to fine-grain scheduling (memory gap) : a cache 
miss may inhibit the processor from achieving a high dynamic ILP, even if the scheduler 
has extracted it at compile time. Even if someone would expect that spill codes exhibit 
high locality, and hence would likely produce cache hits, we cannot assert it at compile 
time. The authors in [6^1 related that about 66% of application execution times are spent 
to satisfying memory requests. 

Another reason for handling register constraints prior to ILP scheduling is that register 
constraints are much more complex than resource constraints. Scheduling under resource 

^ We refer here to M I Idep since we will not consider any resource constraints. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 17-[32| 2003. 

© springer- Verlag Berlin Heidelberg 2003 



18 



S.-A.-A. Touati and C. Eisenbeis 



constraints is a performance issue. Given a DDG, we are sure to find at least one valid 
schedule for any underlying hardware properties (a sequential schedule in extreme case, 
i.e., no ILP). However, scheduling a DDG with a limited number of registers is more 
complex. We cannot guarantee the existence of at least one schedule. In some cases, we 
must introduce spill code and hence we change the problem (the input DDG). Also, a 
combined pass of scheduling with register allocation presents an important drawback if 
not enough registers are available. During scheduling, we may need to insert load-store 
operations. We cannot guarantee the existence of a valid issue time for these introduced 
memory access in an already scheduled code; resource or data dependence constraints 
may prevent from finding a valid issue slot inside an already scheduled code. This forces 
to iteratively apply scheduling followed by spilling until reaching a solution. 

All the above arguments make us re-think new ways of handling register pressure 
before starting the scheduling process, so that the scheduler would be free from register 
constraints and would not suffer from excessive serializations. 

Existing techniques in this field usually apply register allocation after a step of 
software pipelining that is sensitive to register requirement. Indeed, if we succeed 
in building a software pipelined schedule that does not produce more than R values 
simultaneously alive, then we can build a cyclic register allocation with R available 
registers B3I14I1 . We can use either loop unrolling (3, inserting move operations lEl, ora 
hardware rotating register file when available lTl4rl . Therefore, a great amount of work 
tries to schedule a loop such that it does not use more than R values simultaneously 
alive I9I23I13I15I12I5I16I7I1QI . In this paper we directly work on the loop DDG and 
modify it in order to satisfy the register constraints for any further subsequent software 
pipelining pass. This idea is already present in |(T] for DAGs and use the concept of 
reuse edge or vector developed in [181191 . 

Our article is organized as follows. Sect.[2|defines our loop model and a generic ILP 
processor. Sect. [3 starts the study with a simple example. The problem of cyclic register 
allocation is described in Sect.|4]and formulated with integer linear programming (intLP). 
The special case where a rotating register file (RRF) exists in the underlying processor 
is discussed in Sect. 0 In Sectia we present a polynomial subproblem. Finally, we 
synthesize our experiments in Sect, dbefore concluding. 

2 Loop Model 

We consider a simple innermost loop (without branches). It is represented by a graph 
G = (V,E,S,X), such that : V is the set of the statements in the loop body and E is 
the set of precedence constraints (flow dependences, or other serial constraints). We 
associate to each sltc e e E sl latency 6{e) in terms of processor clock cycles and a 
distance A(e) in terms of number of iterations. We denote by u{i) the instance of the 
statement u ^ V of the iteration i. A valid schedule a must satisfy : 

Ve = {u^v) G E : + 6{e) < a(v{i -b A(e))) 

^ Insertion of move operations or using a rotating register file requires R-\-l registers at most 01. 
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We consider a target RISC-style architecture with multiple register types, where T 
denotes the set of register types (for instance, T = {int, floaty). We make a difference 
between statements and precedence constraints, depending if they refer to values to be 
stored in registers or not. is the set of values to be stored in registers of type t G T. 
We consider that each statement u^V writes into at most one register of a type t ^T. 
The statements which define multiple values with different types are accepted in our 
model if they do not define more than one value of a certain type. is the set of flow 
dependence edges through a value of type t e T. The set of consumers (readers) of a 
value is then the set : 



Cons{v}') = {v \ {u,v) ^ 

To consider static issue VLIW and EPIC/IA64 processors in which the hardware 
pipeline steps are visible to compilers (we consider dynamically scheduled superscalar 
processors too), we assume that reading from and writing into a register may be delayed 
from the beginning of the schedule time, and these delays are visible to the compiler 
(architectural visible). We define two delay (offset) functions 5r^t and in which : the 
read cycle of from a register of type t is cr{u) + and the the write cycle of 

into a register of type t is cr{u) + Sw,t{u)- 

For superscalar and EPIC/IA64 processors, Sr,t and are equal to zero. 

A software pipelining is a function a that assigns to each statement u a scheduling 
date (in terms of clock cycle) that satisfies at least the precedence constraints. It is defined 
by an initiation interval, noted //, and the scheduling date au for the operations of the 
first iteration. The operation u{i) of iteration i is scheduled at time + (i — 1) x II. 
For all edge e = {u, v) G E, this periodic schedule must satisfy: 

E ^ + A(e).// 

Classically, by adding all such inequalities on any circuit C of G, we find that II 
must be greater than or equal to maxc A(e) ’ commonly denote as Mil 

(minimal initiation interval). 

We consider now a number of available registers p and all the schedules that have 
no more than p simultaneously alive variables. Any actual following register allocation 
will induce new dependences in the DDG. Hence, register pressure has influence on the 
expected //, even if we assume unbounded resources. What we want to analyze here is 
the minimum II that can be expected for any schedule using less than p registers. We 
will denote this value as MII{p) and we will try to understand the relationship between 
MII{p) and p. Let us start by an example to fix the ideas. 

3 Basic Ideas 

We give now more intuitions to the new edges that we add between couples of operations. 
These edges represent possible reuse by the second operation of the register released by 
the first operation. This can be viewed as a variant of m or Em. 
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Let us consider a simple loop that consists of a unique flow dependence from utov 
with distance A = 3 (see Fig.[l}(a) where values to be stored in registers of the considered 
type are in bold circles, and flows are in bold edges). If we have an unbounded number 
of registers, all iterations of this loop can be run in parallel since there is no recurrence 
circuit in the DDG. At each iteration, operation u writes into a new register. Now, let 




{Sr,t{v) - p - A)' 




(5, A) 



(a) Simple DDG 



(b) Antidependence 



Fig. 1. Simple Example 



us assume that we only have p = 5 available registers. The different instances of u 
can use only p = 5 registers to cyclically carry their results. In this case, the operation 
u{i^ p) writes into the same register previously used hy u{i). This fact creates an anti- 
dependence from i;(i + A), which reads the value defined by u{i), to u{i + p ) ; this means 
an anti-dependence in the DDG from v to u with a distance p — A = 2. Since u actually 
writes into its destination register 6w,t{u) clock cycles after it is issued and v reads it 
6r,t {v) after it is issued, the latency of this anti-dependence is set to 6r,t {v) — Sw,t (u) for 
VLIW or EPIC codes, and to 1 for superscalar (sequential) codes. Consequently, the DDG 
becomes cyclic because of storage limitations (see Fig.[TJ(b), where the anti-dependence 
is dashed). The introduced anti-dependence, also called “Universal Occupancy Vector’ 
’(UOV) in [H, must in turn be counted when computing the new minimum initiation 
interval since a new circuit is created. 

When an operation defines a value that is read by more than one operation, we cannot 
know in advance which of these consumers actually kills the value (which is the last 
reader), and hence we cannot know in advance when a register is freed. We propose a 
trick which defines for each value of type t a fictitious killing task kut . We insert an 
edge from each consumer v G Cons{u^) to kut to reflect the fact that this killing task 
is scheduled after the last scheduled consumer (see Fig. [2|). The latency of this serial 
edge is set to because of the reading delay, and we set its distance to —A where 

A is the distance of the flow dependence between u and its consumer v. This is done to 
model the fact that the operation kut (i + A — A), i.e., kut (i) is scheduled when the value 
u^(i) is killed. The iteration number i of the killer of u(i) is only a convention and can 
be changed by retiming fTTIl . without changing the nature of the problem. 

Now, a register allocation scheme consists of defining the edges and the distances 
of reuse. That is, we define for each u(i) the operation v and iteration pu,v such that 
v(i + Pu,v) reuses the same destination register as u(i). This reuse creates a new anti- 
dependence from ku to v with latency equal to —6uj,t{v) for VLIW or EPIC codes, and 
to 1 for sequential superscalar codes. The distance pu,v of this edge has to be defined. 
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We will see in a further section that the register requirement can be expressed in terms 
of f^u,v • 



Hence, controlling register requirement means, first, determining which operation 
should reuse the register killed by another operation (where should anti- dependences 
he added?). Secondly, we have to determine variable lifetimes, or equivalently register 
requirement (how many iterations later (ji) should reuse occur)l The lower is the /i, the 
lower is the register requirement, but also the larger is the MIL 

Fig. [21(a) presents a first reuse decision where each statement reuses the register 
freed by itself. This is illustrated by adding an anti-dependence from ku (resp. ky) to 
u (resp. v) with an appropriate distance /i, as we will see later. Another reuse decision 
(see Fig. [21(b)) may be that the statement u (resp. v) reuses the register freed by v (resp. 
u). This is illustrated by adding an anti-dependence from ku (resp. ky) to v (resp. u). In 
both cases, the register requirement is /ii + but it is easy to see that the two schemes 
do not have the same impact on Mil: intuitively it is better that the operations share 
registers instead of using two different pools of registers. The next section gives a formal 
definition of the problem and provides an exact formulation. 

4 Problem Description 

4.1 Data Dependences and Reuse Edges 

The reuse relation between the values (variables) is described by defining a new graph 
called a reuse graph that we note = (VR,t, Ey^p). Fig. [21(a) shows the first reuse 
decision where u (v resp.) reuses the register used by itself pi (p 2 resp.) iterations 
earlier. Fig. [21(b) is the second reuse choice where u (v resp.) reuses the register used by 
V (u resp.) Pi (p 2 resp.) iterations earlier. Each edge e = (u^ v) G Ey with a distance 
p(e) in the reuse graph means that there is an anti-dependence between ku and v with a 
distance p(e). The resulted DDG after adding the killing tasks and the anti-dependences 
to apply the register reuse decisions is called the DDG associated with a reuse decision : 
Fig. 121(a) is the associated DDG with Fig. [21(a), and Fig. [21(b) is the one associated with 
Fig. [21(b). We denote by G^y the DDG associated to a reuse decision r. 




(a) First Reuse Decision 



(b) Another Allocation Scheme 



Fig. 2. Killing Tasks 
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G) ^ 



(a) First Reuse Graph 



|t2 

(b) Second Reuse Graph 



Fig. 3. Reuse Graphs 



A reuse graph must verify some constraints to be valid : first, the resulting DDG 
must be schedulable; second, each value reuses only one freed register, and each register 
is reused by only one value. The second constraint means that the reuse scheme is the 
same at each iteration. Generalizing this condition by allowing different (but periodic) 
reuse schemes is beyond the scope of this paper. This condition results in the following 
lemma. 

Lemma 1. K22\l Let = (VR,t , ji) be a valid reuse graph of type t associated with 
a loop G = (y, S, A). Then : 

- the reuse graph only consists of elementary and disjoined circuits ; 

- any value vj G Vr^i belongs to a unique circuit in the reuse graph. 

Any circuit C in a reuse graph is called a reuse circuit. We note /i(C) the sum of 
the ji distances in this circuit. Then, to each reuse circuit G = ..^Un^uf), there 

exists an image G' = {uq kuQ , , r^o) foi* it in the associated DDG. 

For instance in Fig.|2](a), G' = {v^vi^ky, v) is an image for the reuse circuit G = (v^v) 
in Fig.El(a). Such image may not be unique. 

If a reuse graph is valid, we can build a cyclic register allocation in the DDG as- 
sociated with it, as explained in the following theorem. We require registers, in 

which p.{G^) is the sum of all p. distances in the reuse graph G^ . 

Theorem 1. / E21/ Let G = (V, E^ S, A) be a loop and G^ = (VR,t, Ey^p) a valid reuse 
graph of a register type t £ T. Then the reuse graph G^ defines a cyclic register 
allocation for G with exactly pt{G^) registers of type t if we unroll the loop a times 
where : 



with C = {Gi, • • • , Gn} is the set of all reuse circuits, and Icm is the least common 
multiple. 

For a complete and detailed proof, please refer to [|^ . 

As a corollary, we can build a cyclic register allocation for all register types. 

Corollary 1. / |22F Let G = iy^E^d^X) be a loop with a set of register types T. To each 
type t G T is associated a valid reuse graph Gj. The loop can be allocated with pt{G'^) 
registers for each type t if we unroll it a times, where 

a = lcm{at,,-“ ,«*„) 

where at. is the unrolling degree of the reuse graph of type ti. 



a = • ■ ■ ,/J.t{Cn)) 
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It should be noted that the fact that the unrolling factor may be significantly high 
is not related to our method and would happen only if we actually want to allocate the 
variables on this minimal number of registers with the computed reuse scheme. However, 
there may be other reuse schemes for the same number of registers, or there may be other 
available registers in the architecture. In that case, the meeting graph framework II can 
help to control or reduce this unrolling factor. 

From all above, we deduce a formal definition of the problem of optimal cyclic 
register allocation with minimal ILP loss. We call it Schedule Independent Register 
Allocation (SIRA). 

Problem 1 ( SIRA ). Let G = (y, E’, (5, A) be a loop and IZt the number of available regis- 
ters of type t. Find a valid reuse graph for each register type such that the corresponding 

and the critical circuit in G is minimized. 

This problem can be reduced to the classical NP-complete problem of minimal register 
allocation fTH . The following section gives an exact formulation of SIRA. 



4.2 Exact Formulation 

In this section, we give an intLP model for solving SIRA. It is built for a fixed execution 
rate II (the new constrained Mil). Note that //is not the initiation interval of the final 
schedule, since the loop is not already scheduled. II denotes the value of the new desired 
critical circuit. Here, we assume VLIW or EPIC codes. For superscalar ones, we only 
have to set the anti-dependence latency to 1 . 

Our SIRA exact model uses the linear formulation of the logical implication (=^) 
and equivalence (<^=^) by introducing binary variables, as previously explained in (20l 
Em. The size of our system is bounded by 0{\V\^) variables and 0{\E\ + \V\^) 
linear constraints. 

Basic Variables 

- a schedule variable cFu>^ for each operation u eV, including one for each killing 
node kut . 

- a binary variables ^ for each {u^ register type t eT.lt is 

set to 1 iff (i4, i;) is a reuse edge of type t ; 

- for reuse distance for all {u, v) G and for each register type. 

Linear Constraints 

- data dependences (the existence of at least one valid software pipelining schedule, 
including killing tasks constraints) 



Ve = (u^v) e E : cfu E S{e) < -h // x A(e) 
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- there is an anti-dependence between kut and v if {u^v) is sl reuse edge : 

Vier, y{u,v)eVlt : 

^u,v = 1 < cr„ + // X 

- If there is no register reuse between two values {reuset{u) ^ v), then ^ = 0. 
The anti-dependence distance ^ must be set to 0 in order to not be cumulated in 
the objective function. Vt G T, V(i^, v) G ^ : 

€,V = 0 ^ = 0 

The reuse relation must be a bijection from Vr^i to Vr^i • 

- a register can be reused by only one operation : 

^teT,^ueVR,f. C = l 

- one value can reuse only one released register : 

Vt eT,^ue Vr,* : ^ = 1 

veVn,t 

Objective Function. We want to minimize the number of registers required for the 
register allocation. So, we choose an arbitrary register type t which we use as objective 
function : 

Minimize 

{u,v)evl^ 

The other registers types are bounded in the model by their respective number of 
available registers : 



Vi'er-W: E 

(u,v)evy 

As previously mentioned, our model includes writing and reading offsets. The non- 
positive latencies of the introduced anti-dependences generate a specific problem. In- 
deed, some circuits C in the constructed DDG may have non-positive distance A(C) < 0. 
Even if such circuits do not prevent a DDG from being scheduled, it may be so in the 
presence of resource constraints. Thus, we prohibit such circuits (we will discuss it later). 
Note that this problem does not occur for superscalar (sequential) codes, because the 
introduced anti-dependences have positive latencies. 

The unrolling degree is left free and over any control in SIRA formulation. This factor 
may theoretically grow exponentially. Minimizing the unrolling degree is to minimize 
lcm{iii), the least common multiple of the anti-dependence distances of reuse circuits. 
This non linear problem is very difficult an remains an open problem : as far as we know, 
there is not a satisfactory solution for it. Fortunately, there exists a hardware feature that 
allow to avoid loop unrolling. We study it in the next section. 
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5 Rotating Register Files 

A rotating register file I4I14I17II is a hardware feature that moves (shift) implicitly archi- 
tectural registers in a cyclic way. At every new kernel issue (special branch operation), 
each architectural register specified by program is mapped by hardware to a new physical 
register. The mapping function is (R denotes an architectural register and R' a physical 
register) : Ri ^[i-\-RRB) mod s where RRB is a rotating register base and s the total 
number of physical registers. The number of that physical register is decremented contin- 
uously at each new kernel. Consequently, the intrinsic reuse scheme between statements 
describes a hamiltonian reuse circuit necessarily. The hardware behavior of such register 
files does not allow other reuse patterns. SIRA in this case must be adapted in order to 
look only for hamiltonian reuse circuits. 

Furthermore, even if no rotating register file exists, looking for only one hamiltonian 
reuse circuit makes the unrolling degree exactly equal to the number of allocated 
registers, and thus both are simultaneously minimized by the objective function. 

Since a reuse circuit is always elementary (Lemma [T}, it is sufficient to state that 
a hamiltonian reuse circuit with n = |VR,t | nodes is only a reuse circuit of size n. We 
proceed by forcing an ordering of statements from 1 to n according to the reuse relation. 
Thus, given a loop G = (V, S, A) and G^ = (VR^t, valid reuse graph of type 

t e T, we define a hamiltonian ordering hot as a function : 



The existence of a hamiltonian ordering is a sufficient and necessary condition to 
make the reuse graph hamiltonian, as stated in the following theorem. 

Theorem 2. / |22|^ Let G = (V^E^S^X) be a loop and G^ a valid reuse graph. There 
exists a hamiltonian ordering iff the reuse graph is a hamiltonian graph. 

Hence, the problem of cyclic register allocation with minimal critical circuit on rotating 
register files can be stated as follows. 

Problem 2 (SlRAJiAM). Let G = (V, E^ S, A) be a loop and TZt the number of available 
registers of type t. Find a valid reuse graph with a hamiltonian ordering hot such that 
the 



in which the critical circuit in G is minimized. 

An exact formulation for it is deduced from the intLP model of SIRA. We have only 
to add some constraints to compute a hamiltonian ordering. We expand the exact SIRA 
intLP model by at most 0{\V\‘^) variables and 0{\V\‘^) linear constraints. 



hot : VR^t N 

1 -^ hot{u) 



such that v G Vr^i 




hticn < 
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1. for each register type and for each value G we define an integer variable 
hout > 0 which corresponds to its hamiltonian ordering ; 

2. we add the linear constraints of the modulo hamiltonian ordering : \/u,veV^^ ; 



where /?* „ is a binary variable that holds to the integer division of ho„t + 1 on 



When looking for a hamiltonian reuse circuit, we may need one extra register to 
construct such a circuit. In fact, this extra register virtually simulates moving values 
among registers if circular lifetimes intervals do not meet in a hamiltonian pattern. 

Proposition 1. I \22)l Hamiltonian SIRA needs at most one extra register than SIRA. 

Both SIRA and hamiltonian SIRA are NP-complete. Fortunately, we have some 
optimistic results. In the next section, we investigate the case in which SIRA can be 
solved in polynomial time complexity. 

6 Fixing Reuse Edges 

In TTSi . Ning and Gao analyzed the problem of minimizing the buffer sizes in software 
pipelining. In our framework, this problem actually amounts to deciding that each oper- 
ation reuses the same register, possibly some iterations later. Therefore we consider now 
the complexity of our minimization problem when fixing reuse edges. This generalizes 
the Ning-Gao approach. Formally, the problem can be stated as follows. 

Problem 3 ( Fixed SIRA ). Let G = (V, A) be a loop and IZt the number of available 
registers of type t. Let E' C E he the set of already fixed anti-dependences (reuse) edges 
of a register type t. Find a distance for each anti-dependence {kut^v) G E' such 
that 



in which the critical circuit in G is minimized. 

In following, we assume that E' C E is the set of these already fixed anti- 
dependences (reuse) edges (their distances have to be computed). Deciding (at com- 
pile) time for fixed reuse decisions greatly simplifies the intLP system of SIRA. It can be 
solved by the following intLP, assuming a fixed desired critical circuit II. Here, we write 
a system for VLIW or EPIC codes. For superscalar, we have to set the anti-dependence 
latency to 1 . 



^u,v — 1 + 1 — X 









Minimize 




Subject to: 



( 1 ) 



II X +<^v- > -Sw{v) y{kut,v) e E' 

o'v — — II X A(e) Ve = (u^v) e E — E' 
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Since // is a constant, we do the variable substitution = // x ^ and System [T] 
becomes : 



Minimize {H-P =)T,ueVR,t 

Subject to: (2) 

p'u + '^v- > -K{v) '^{kut,v) e E' 

o~v — ^ — II X A(e) Ve = {u,v) e E — E' 

There are 0(|y|) variables and 0{\E\)) linear constraints in this system. 

Theorem 3. It27\l The constraint matrix of the integer programming model in System^ 
is totally unimodular, i.e., the determinant of each square sub-matrix is equal to 0 or to 
± 1 . 

Consequently, we can use polynomial algorithms to solve this problem of finding the 
minimal value for the product 1 1, p. 

We must be aware that the back substitution in p = ^ may produce a non integral 
value for the distance /i. If we ceil it by setting p = [jy] , a sub-optimal solution may 
resu li It is easy to see that the loss in terms of number of registers is not greater than the 
number of loop statements that write into a register (| |). This algorithm generalizes 

the heuristics proposed in 1TT]I . We think that we can avoid ceiling p by considering the 
already computed a variables, as done in [HTI . 

Furthermore, solving SystemElhas two interesting follow-ups. First, it gives a poly- 
nomially computable lower bound for M I Irc{p) as defined in the introduction, for this 
reuse configuration rc. Let us denote as m the minimal value of the objective function. 
Then 



Mllrcip) > — 

P 

This lower bound could be used in a heuristics such that the reuse scheme and the 
number of available registers p are fixed. Second, if // is fixed, then we obtain a lower 
bound on the number of registers p required in this reuse scheme rc. 



There are numerous choices for fixing reuse edges that can be used in practical 
compilers. 

1. For each value u G we can decide that reuset{u) = u. This means that each 
statement reuses the register freed by itself (no sharing of registers between different 
statements). This is similar to buffer minimization problem as described in [II. 

2. We can fix reuse edges according to the anti-dependences present in the original 
code : if there is an anti-dependence between two statement u and v in the original 
code, then fix reuset{u') = v with the property that u kills u' . This decision is a 
generalization to the problem of reducing the register requirement as studied in ||23l . 

^ Of course, if we have Mil = 11—1 (case of parallel loops for instance), the solution remains 
optimal. 
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3 . If a rotating register file is present, we can fix an arbitrary (or with a cleverer method) 
hamiltonian reuse circuit among statements. 

As explained before, our model includes writing and reading offsets. The negative 
latencies of the introduced anti-dependences generate a specific problem for VLIW 
codes. The next section solves this problem. 

Eliminating Non-positive Circuits. From the scheduling theory, circuits with non- 
positive distances do not prevent a DDG from being scheduled (if the latencies are 
non-positive too). But such circuits impose hard scheduling constraints that may not 
be satisfiable by resource constraints in the subsequent pass of instruction schedulin£|. 
Therefore these circuits have to be forbidden. 

Alain Darte provides us a solution deduced from |2l . We add a quadratic number of 
retiming constraints to avoid non-positive circuits. We define a retiming Ve for each edge 
e e E. We have then a shift Ve{u) for each node u e V. We declare then an integer Ve,u 
for all (e, u) e {E x V). Any retiming rg must satisfy the following constraints : 

Ve' = {u', v') i- e, - Tey + A(e') > 0 

for the edge e = (u^ u), rg,v — rg,u + A(e) > 1 

Note that an edge e = {ky^v) G E' is an anti-dependence, i.e., its distance is A(e) = 
to be computed. Since we have \E\ distinct retiming functions, we add \E\ x \V\ 
variables and \E\ x \E\ constraints. The constraint matrix is totally unimodular, and it 
does not alter the total unimodularity of System [2l The following lemma proves that 
satisfying System |3| is a necessary and sufficient condition for building a DDG G^r 
with positive circuits distances. 

Lemma 2. Il22\l Let G^r the solution graph of System\7\or System\2\ Then : 

SystemUiis satisfied any circuit in G^r has a positive distance \{G) > 0. 

The next section summarizes our experimental results. 

7 Experiments 

All the techniques described in this paper have been implemented and tested on various 
numerical loops extracted from different benchmarks (Spec95, whetstone, Evermore, 
lin-ddot). This section presents a summary. 

Optimal and Hamiltonian SIRA. In almost all the cases, both of the two techniques 
need the same number of registers according to the same II. However, as proved by 
PropH] hamiltonian SIRA may need one extra register, but in very few cases (about 
5% of experiments). Regarding the resulted unrolling degrees, even if it may grow 
exponentially with SIRA (from the theoretical perspective), experiments show that it is 
mostly lower than the number of allocated registers, i.e., better than hamiltonian SIRA. 
However, some few cases exhibit critical unrolling degrees which are not acceptable if 
code size expansion is a critical factor. 

^ This is because circuits with non-positive distances impose scheduling constraints of type “not 
later than”. 
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Optimal SIRA versus Fixed SIRA. In a second step of experiments, we investigate the 
fixed SIRA strategies (Sect. to compare their results versus the optimal ones (optimal 
SIRA). We checked the efficiency of two strategies : self reuse strategy (no register 
sharing), and fixing an arbitrary hamiltonian reuse circuit. Resolving the intLP systems 
of these strategies become very fast compared to optimal solutions, as can be seen the 
first part of Fig. (4) We couldn’t explore optimal solutions for loops larger than 10 nodes 
because the computation time became intractable. 




□ Optimal SI RA ■ FiKedSIRA.Sysleml O FboedSI RA, System 2 




Fig. 4. Optimal versus Fixed SIRA with // = Mil 



For II = Mil, some experiments do not exhibit a substantial difference. But if we 
vary II from Mil to an upper-bound L, the difference is highlighted as follows. 

- Regarding the register requirement, the self reuse strategy is, in most cases, far from 
the optimal. Disabling register sharing needs a high number of registers, since each 
statement needs at least one register. However, enabling sharing with an arbitrary 
hamiltonian reuse circuit is much more beneficial. 

- Regarding the unrolling degree, the self reuse strategy exhibit the lowest ones, except 
in very few cases. 

Fixed SIRA : System{J\versus System\^ The compilation time for optimal SIRA becomes 
intractable when the size of the loop exceeds 10 nodes. Hence, for larger loops, we 
advice to use our fixed SIRA strategies that are faster but allow sub-optimal results. We 
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investigated the scalability (in terms of compilation time0 versus the size of DDGs) for 
fixed SIRA when solving System[T](non totally unimodular matrix) or System |2](totally 
unimodular matrix). Fig. 0 plots the compilation times for larger loops (buffers and 
fixed hamiltonian). For loops larger than 300 nodes, the compilation time of System [T] 
becomes more considerable. The error ratio, induced by ceiling the /i variable as solved 
by System[2|compared to System [1] is depicted in Fig.0 As can be seen, such error ratio 
asks us to improve the results of System[2|by re-optimizing the /i variables in a cleverer 
method as done in 





Number of Nodes 



Number of Nodes 



Fig. 5. Compilation Time versus the Size of the DDGs 



8 Conclusion 

This article presents a new approach consisting in virtually building an early cyclic 
register allocation before code scheduling, with multiple register types and delays in 
reading/writing. This allocation is expressed in terms of reuse edges and reuse distances 
to model the fact that two statements use the same register as storage location. An intLP 
model gives optimal solution with reduced constraint matrix size, and enables us to make 
a tradeoff between ILP loss (increase of Mil) and number of required registers. 

The spilling problem is left for future work. We believe that it is important to take 
it in consideration before instruction scheduling, and our framework should be very 
convenient for that. 

When considering VLIW and EPIC/IA64 processors with reading/writing delays, we 
are faced to some difficulties because of the possible non-positive distance circuits that 
we prohibit. However, we allow anti-dependences to have non-positive latencies, because 
this amounts to consider that the destination register is not alive during the execution 
of the instruction and can be used for other variables. Since pipelined execution time is 
increasing, this feature becomes crucial in VLIW and EPIC codes to reduce the register 
requirement. 

Each reuse decision implies loop unrolling with a factor depending on reuse circuits 
for each register type. The unrolling transformation can be applied before the software 

^ counted as the time for generating and solving the intLP systems, and building the allocated 
DDGs. 



Early Control of Register Pressure for Software Pipelined Loops 



31 




Number of Nodes 




Number of Nodes 



Fig. 6. Error Ratio in Terms of Register Requirement, Induced by System |2] versus the Size of the 
DDGs 



pipelining pass (the inserted anti-dependences restrict the scheduler and satisfy register 
constraints) or after it during code generation step. It is better to unroll the loop after 
software pipelining in order to do not increase the scheduling complexity under resources 
constraints. Optimizing the unrolling factor is a hard problem and no satisfactory solution 
exists until now. However, we do not need loop unrolling in the presence of a rotating 
register file. We only need to seek a unique hamiltonian reuse circuit. The penalty for 
this constraint is at most one extra register than the optimal for the same II. 

Since computing an optimal cyclic register allocation is intractable in real loops, we 
have identified one polynomial subproblem by fixing reuse edges. With this polynomial 
approach, we can compute MII{p) for a given reuse configuration and a given register 
count p. We can also heuristically find a register usage for one given II. 

Our experiments show that disabling sharing of registers with a self reuse strategy 
isn’t a good reuse decision in terms of register requirement. We think that how registers 
are shared between different statements is one of the most important issues, and pre- 
venting this sharing by self reuse strategy consumes much more registers than needed 
by other reuse decisions. 
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Abstract. We here present new insights of properties of real-life inter- 
ference graphs emerging in register allocation. These new insights imply 
good hopes for a possibility of improving the coloring approach towards 
optimal solutions. The conclusions are based on measurements of nearly 
28,000 real instances of such interference graphs. All the instances ex- 
plored are determined to possess the so-called 1-perfectness property, a 
fact that seems to make them easy to color optimally. The exact algo- 
rithms presented not only produce better solutions than the traditional 
heuristic methods, but also, indeed, seem to perform surprisingly fast, 
according to the measurements on our implementations. 



1 Introduction 



For almost all architectures register allocation is among the most important of 
compiler optimizations. Computations involving only register operands are much 
faster than those involving memory operands. An effective utilization of the 
limited register file of the target machine may tremendously speed up program 
execution, compared to the same program compiled with a poor allocation. 

Graph coloring is an elegant approach to the register allocation problem. Tra- 
ditional algorithms used by compilers today IdjbfSIQIl'T] make use of approximate 
heuristics to accomplish the colorings. 

Here we do not propose a new algorithm for register allocation. Our exper- 
iments, however, suggest that such an algorithm may well be designed, which 
guarantees optimal colorings for the purpose of a good allocation. Despite the 
fact that graph coloring is an NP-complete problem, the input graphs in the case 
of register allocation certainly seem to be efficiently colored, even when using an 
exact algorithm. 



2 Background 

Let V = {vi,V2,V3 , . . .} be the set of variables in a given intermediate represen- 
tation (IR) of a program. Given a certain point p in the program flow, a variable 
Vi E V is said to be live if it is defined above p but not yet used for the last time. 
A live range (LR) for a variable G U is a sequence of instructions beginning 
with the definition of Vi and ending with the last use of Vi. An LR interferenee 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 33^45] 2003. 
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is a pair (•, •) of variables whose live ranges intersect. Variables involved in such 
an interference can not be assigned to the same register. We denote by E the 
set of LR interference pairs. The register alloeation problem is the problem of 
finding a mapping c : V ^ ^ 2 , • • • , where are the registers of the 

target machine, such that k < N, where N is the total number of registers avail- 
able and such that {vi^Vj) ^ E ^ c(vi) 7 ^ c{vj). This corresponds closely to 
the well-known Graph- COLORING problem of finding a /c-coloring of the graph 
G = which we call the interferenee graph (IG). 



2.1 Graph Coloring 

More exactly the Graph- Coloring problem is to determine for a given graph G 
and a positive integer k whether there exists a proper /c-coloring. The smallest 
positive integer k for which a /c-coloring exists is called the ehromatie number 
of G^ which is denoted by x(G). Graph-Coloring is NP-complete [8]. 

The coloring problem seems not only practically impossible to solve exactly 
in the general case. Numerous works in this field from the past decades show that 
it is very hard to find algorithms that give good approximate solutions without 
restricting the types of input graphs. One well-known and obvious lower bound 
on the chromatic number x(G) is the elique number^ which is denoted by o;(G). A 
elique Q in a graph G = (V, is a subset of V such that the subgraph G' induced 
by Q is eomplete, i.e., a graph in which all vertices are pairwise adjacent, and 
hence have to be colored using no less than \Q\ colors. The Maximum- Clique 
problem asks for the size of the largest clique of a given graph, the solution of 
which is the elique number u{G). There are, however, two problems with this 
lower bound: 

1 . Maximum- Clique is (also) NP-complete [5]. 

2 . According to, e.g., Kucera the gap between the clique number u and the 
chromatic number x is usually so large, that uj seems not to be usable as a 
lower bound on x |H]- 



2.2 Traditional Approaches 



Since Graph-Coloring is NP-complete |H], traditional register allocation im- 
plementations rely on an approximate greedy algorithm for accom- 

plishing the colorings. The technique used in all these implementations is based 
on a simple coloring heuristic m- 

If G = (y^E) eontains a vertex v with a degree 6{v) < k, i.e., with fewer 
than k neighbors, then let G' be the graph G — {v}, obtained by removing v, i.e., 
the subgraph of G indueed by V \ {i;}. If G' ean be eolored, then so ean G, for 
when reinserting v into the eolored graph G' , the neighbors of v have at most 
k — 1 eolors among them. Henee a free eolor ean always be found for v. 

The reduction above is called the simplify pass. The vertices reduced from 
the graph are temporarily pushed onto a stack. If, at some point during sim- 
plification, the graph G has vertices only of signifieant degree, i.e., vertices v of 
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degree S(v) > k, then the heuristic fails and one vertex is marked for spilling. 
That is, we choose one vertex in the graph and decide to represent it in memory, 
not registers, during program execution. If, during a simplify pass, one or more 
vertices are marked for spilling, the program must be rewritten with explicit 
loads and stores, and new live ranges must be computed. Then the simplify pass 
is repeated. This process iterates until simplify succeeds with no spills. Finally, 
the graph is rebuilt, popping vertices from the stack and assigning colors to 
them. This pass is called select. 

3 Interference Graph Characterization 

The traditional methods described above, which are used in register allocation 
algorithms today, are approximate. Our experiments show, however, that making 
optimal colorings using exponential algorithms, may actually be a possible way 
of coloring graphs in the register allocation case. The key to this conjecture is 
the claimed so-called 1-perfectness of interference graphs. 



3.1 Graph Perfectness 

In the study of the so-called Shannon capacity of graphs, Laszlo Lovasz in the 
1970’s introduced the 'd-function, which has enjoyed a great interest in the last 
decades. For instance, its properties constitute the basis of a later on proven 
fact, that there are important instances of graphs (the so-called perfect graphs), 
whose possible /c-colorability can indeed be determined in polynomial time. 

The 'd(G) function has two important and quite remarkable properties pT] : 

1. uj{G) < 'd(G) < x(G)0 (The Sandwich Theorem) 

2. For all graphs G, 'd(G) is computable in polynomial time. 

Those special instances of graphs G for which uj{G') = x(G') holds for each 
induced subgraph G' are said to be perfect, and they are indeed perfect in that 
particular sense that their possible /c-colorability can be determined in polyno- 
mial time, as a direct consequence of the above properties. There are, however, 
no (or at least very few) proposals of algorithms which use this fact, and which 
run efficiently in practice. Moreover, the status of the recognition problem of the 
class of all perfect graphs is unknown. 

Despite the fact that nobody has succeeded in designing an algorithm that 
efficiently solves the polynomial problem of perfect graphs, the elegance of the 
theory of these special instances makes it interesting to explore the possible 
perfectness of the interference graphs occurring in register allocation. 

The complement graph of G = {V, E) is the graph G = (V,E), where 

E = -^e = {u, v) \ u, V ^ V, u ^ V, and (u,v) 0 F} . 
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Furthermore, Olivier Coudert, who works in the field of logic synthesis and 
verification, wrote in 1997 a very interesting paper [3, on the claimed simplicity 
of coloring “real-life” graphs, i.e., graphs which occur in problem domains such 
as VLSI CAD, scheduling, and resource binding and sharing. This simplicity is, 
according to Coudert, basically a consequence of the fact that most of the graphs 
investigated are 1 -perfect, i.e., they are graphs such that uj{G) = x(G), however, 
not necessarily for all subgraphs as in the case of perfect graphs. 

4 Interference Graph Experiments 

The input graphs we have used for our metrics come from two sources: 

— Andrew W. Appel and Lai George have published a large set of interfer- 
ence graphs [T] generated by their compiler for Standard ML of New Jersey, 
compiling itself. The 27,921 actual LR interference graphs differ in size from 
around 25 vertices and 200 edges up to graphs with 7,500 vertices and 70,000 
edges. These graphs do not, however, constitute the data used in empirical 
measurements reported by the authors in their articles on Iterated Register 
Coalescing |9|10IJ . 

— In a project task in the Lund University course on optimizing compiler^, 
an SSA based experimental lab compiler for a subset of the C programming 
language is provided, in which students are to implement optimization algo- 
rithms. The best contribution in the fall 2000 course was provided by Per 
Cederberg, PhD candidate at the Division of Robotics, Department of Me- 
chanical Engineering, Lund Universit}0 His implementation included, for 
instance, algorithms for constant and copy propagation, dead code elimina- 
tion, global value numbering, loop-invariant code motion and the register 
allocation algorithm proposed by George and Appel. Cederberg kindly let 
us use his implementation for our experiments. 

When looking at the interference graphs we have had access to, they indeed 
seem to demonstrate some characteristics that point towards their potential I- 
perfectness. For example, interference graphs tend not to be very dense, although 
they have large cliques. In other words, the density of these graphs seems not to 
be uniformly spread out over the whole graph, but rather localized to one or a 
few “clique-like” parts. Such characteristics certainly suggest a possibility of I- 
perfectness, and make it interesting to investigate whether Coudert ’s conjecture 
can or cannot be confirmed for interference graphs. 

In order to decide whether a graph G is I-perfect, we need to solve two 
NP-hard problems (as far as we know today), the Graph- Coloring problem 
and the Maximum- Clique problem. Our only possibility is to implement ex- 
act algorithms, i.e., an approximate solution to either of these problems is not 
adequate. 

^ http : //www . cs . 1th . se/Education/Courses/EDA230/ 

^ http://www.robotics.lu.se/ 
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The algorithms for exactly solving both of these problems are fairly simple 
and well-known. The simplicity of the algorithm does, however, not imply that 
they are fast — both of them obviously have exponential worst case execution 
time, since we may need to perform an exhaustive search to find the optimal 
solution. 



4.1 Sequential Coloring 

Algorithm [1] shows a backtracking search for an optimal coloring of an input 
graph G. By initially searching for the maximum clique Q we get not only a 
lower bound u on the chromatic number x, but also a partial coloring of the 
graph. (The vertices of Q are colored using u colors, which is optimal since all 
those vertices are pairwise adjacent in G.) If we are lucky, the graph is 1-perfect, 
and when the recursive part of the algorithm finds a coloring using uj colors, the 
search can be interrupted. 



Algorithm 1 Create an optimal proper coloring of a graph G using a standard 
backtracking search algorithm. Return the chromatic number x(G), and leave 
the coloring in a map color ^ indexed by vertices. 

Sequential-Color(G) 

1 Q ^ Maximum-Clique(G) 

2 k^O 

3 foreach v ^ Q do 

4 k ^ k +1 

5 col or [v] ^ k 

6 return Sequential-Color-Recursive(G, k, \V{G) \ + 1, |Q|) 

Input: G is a graph, partially colored using k colors, x 'Is the eurrent value on the 
ehromatie number and uj is the lower bound given by Maximum-Clique. 
Sequential-Color-Recursive(G, k, X, 

1 if G is entirely colored then 

2 return k 

3 V ^ an uncolored vertex of G 

4 foreach c G [1,Min(/c + 1, x ~ 1)] do 

5 if Vn G N[v], color[n] ^ c then > N[v] is the neighborhood of v 

6 col or [v] ^ c 

7 X ^ Sequential-Color-Recursive(G, Max(c, k ), x , 

8 if X = ^ then > 1-perfect graph 

9 return x 

10 return x > result after an exhaustive search 



If, on the other hand, the graph is not 1-perfect, the maximum clique calculation 
will not be of any help at all. The algorithm then has to exhaustively enumerate 
all potential colorings that would improve on the chromatic number, which can 
take exponential time. The problem is that the lower bound is static in the sense 
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that it is not reevaluated at each recursion. Moreover, the algorithm simultane- 
ously uses several unsaturated colors. (A color c is said to be saturated if it can 
not be used anymore to extend a partial coloring.) Efficiently estimating a lower 
bound on the number of colors necessary to complete an unsaturated coloring is 
an open problem. 

In [3 it is concluded that the maximum clique is tremendously important 
when coloring 1-perfect graphs. If the clique found is not maximal, we are left 
with the same problem as when trying to color graphs which are not 1-perfect. 

One important part of the algorithm is left unspecified: In what order should 
the uncolored vertices be picked? Kucera showed in 1991, that it is practically 
impossible to find an ordering which performs well in the general case when 
using a greedy approach to the coloring problem [15| . 

The ordering of the vertices in the exact sequential algorithm may, however, 
have a strong impact on the efficiency of the search. Brelaz in 1979 proposed 
an efficient method called the DSATUR heuristic [2]. It consists of picking the 
vertex that has the largest saturation number, i.e., the number of colors used by 
its neighbors. If implementing the algorithm carefully, the information on vertex 
saturation can in fact be efficiently maintained, using so-called shrinking sets^ 
as shown by, for example. Turner m- 



4.2 Obtaining the Maximum Clique 

Algorithm El shows a simplified branch-and-bound approach to the Maximum- 
Clique problem. The algorithm relies partially on the calculation of an approx- 
imate coloring of the graph, which is used as an upper bound. 

Algorithm El can be improved in a number of ways without jeopardizing 
optimality of the computed clique [7j. Let G be the graph at some point of 
the recursion, Q the clique under construction, Q the current best solution, and 
{/i, . . . , J/c} a /^-coloring obtained on G. Then the following improvements apply: 

— When we reach a state where |Q| + |E(G)| < |Q|, we can immediately prune 
the search space, since it is impossible to find a larger clique. 

— Every vertex v such that 6{v) < \Q\ — \Q\ must be removed from the graph, 
because it can not be a member of a larger clique. 

— Every vertex v such that 6{v) > |E(G)| — 2, must be put into Q, since 
excluding it can not produce a larger clique. 

— Every vertex v that can be colored with q colors, where q > |Q| — \Q\ 
yield unsuccessful branches, and can be left without further consideration. 

The approximate coloring part of Maximum- Clique is a very simple greedy 
algorithm, using no particular heuristic for the ordering of vertices. The graphs 
have been represented simply as two-dimensional bit matrices. Eor the sake of 
efficiency, the graphs ought to have been redundantly represented as adjacency 
lists as well as matrices, a representation that has been chosen in register allo- 
cators ever since Chaitin’s original algorithm. Our data structures are of course 
easy to extend to this double representation. 
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Algorithm 2 Find the maximum clique of a graph G using a simplified branch- 
and-bound technique. Return the set of vertices contained in the maximum clique 

found. 

Maximum-Clique(G) 

1 return Maximum-Clique- Recursive(G, 0, 0, oo) 

Input: G is the remaining part of the graph, Q is the elique under eonstruetion, Q is 
the largest elique found so far, and u is an upper hound on uo (size of the maximum 
elique) 

Maximum-Clique-Recursive(C, Q, Q, u) 

1 if C is empty then 

2 return Q 

3 {/i, . . . , /fc} ^ Approximate-Color(C) 

4 u ^ Min(i^, \Q\ k) > compute a new upper bound 

5 if R < \Q\ then 

6 return Q 

7 V ^ a maximum degree vertex of G 

8 G' ^ subgraph induced by N[v] 

9 Q ^ Maximum-Clique-Recursive(C', Q U {i;}, Q, u) 

10 u — Q thei^ 

11 return Q 

12 G" ^ graph induced by V[G] — {r»} 

13 return Maximum-Clique- Recursive(C", Q, Q, u) 



5 Experimental Results and Issues for Future Research 

In the experiments with the graphs, using algorithms shown in Section |4] several 
interesting observations have been made. First and very importantly, every single 
graph investigated turned out to he 1-perfeet, that is, for every single instance of 
the almost 28,000 graphs investigated, the chromatic number was determined to 
be exactly equal to the clique number. 

The chromatic numbers of the Appel- George graphs range between 21 and 89. 
Most of them (27,590 graphs) have x = 21; 238 graphs have x = 29. Other test 
programs written, and compiled with the Cederberg compiler, get chromatic 
numbers on the interference graphs with a size of up to 15. All of them are 
1-perfect. Despite numerous persistent tries, we have not managed to create 
one single program that results in a non- 1-perfect interference graph using our 
compilers. 

This experimental result raises two important questions to be further ex- 
plored: 

1. Are interference graphs always 1-perfect? Our experiments give strong em- 
pirical evidence for this. If it is the case, we need to determine why. That 
is, what in the earlier structural optimizations makes the graphs 1-perfect? 
Further graph sets from different (kinds of) compilers need to be examined 
in the future. 
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2. If all, or almost all, interference graphs are 1-perfect, how can we use this 
fact, in order to improve on the efficiency of the existing register allocation 
algorithms? Or should we rather incorporate qualities, such as simplification 
and/or copy propagation by live range coalescing, from the approximate 
algorithms into the exact algorithm? 

The second question partially gets an answer through the second of our experi- 
mental results. We noted when running the exact algorithms on the graphs, that 
they seemed to be surprisingly fast. Hence, the George- Appel allocator was also 
implemented, using the pseudo-code given in m, and the same data structures 
for graph representation as in the exact algorithms. Repeatedly running this 
algorithm for all the graphs, using an Wvalue equal to the chromatic number 
determined by the exact algorithm, and comparing execution times to those of 
the exact algorithm, gave the following result: 

The exact algorithm for computing an optimal coloring is faster than George- 
AppcTs approximate iteration algorithm. 

Of course, the George- Appel allocator suffers a penalty through our choice 
of a data structure — the authors recommend a combination of bit matrices and 
adjacency lists. 

But a change of the data structure would improve on the execution time of 
the exact algorithm as well. The operation for determining the neighborhood of 
a given vertex is expensive when using bit matrices, and it is very frequently 
used in both algorithms. 

The execution times for the two algorithms have been plotted as functions 
of the sizes of the graphs in Fig. |T] One large and, for some reason, very tough, 
however, still 1-perfect graph instance containing 6,827 vertices, 45,914 edges, 
and 4,804 move related instructions has been removed from the data. (The time 
needed by the George- Appel algorithm to create the coloring was 3,290 seconds. 
The exact algorithm needed 120 seconds.) 

In order to show the difference trend in the execution times of the two al- 
gorithms, a second degree polynomial has been fitted to the samples using the 
least-squares method. We do not, however, assert that the execution times are 
quadratic in the sizes of the graphs; the exact algorithm is obviously exponential 
in the worst case. 

In Fig. [2]the same execution times are plotted for the 23,000 smallest graphs 
only, excluding the few very large and extremely tough instances. 

There is one more thing which is important to note in the comparison of 
the two algorithm approaches. In 46 of the 28,000 graphs, the George-Appel 
algorithm fails to find optimum, and spills one or two variables to memory. 
This number of failures is actually impressively low, as the algorithm uses an 
approximate, heuristic method for the NP-complete problem of coloring. Perhaps 
the reason for the good performance of the approximate algorithm is the 1- 
perfectness of the graphs? Nevertheless, in comparison to the exact algorithm, 
these spills are of course unfortunate, especially since it does not seem to take 
longer time to find optimal colorings of the graphs using the exact algorithm. 
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Fig. 1. Execution times for the two algorithms, coloring all graph instances. The dots 
correspond to the exact algorithm; the pluses correspond to the George- Appel allocator. 
The continuous functions are the best approximate second degree polynomials in the 
least-squares sense. We do not, however, assert that the execution times are quadratic 
in the sizes of the graphs. The intention is simply to compare the average execution 
times of the algorithms for the graphs in question. 



6 Conclusions 

Graph coloring is an elegant approach to the register allocation problem. The 
algorithms used by compilers today make use of approximate heuristics to ac- 
complish the colorings. 

In this paper, we do not propose a new algorithm for register allocation. The 
experiments, however, suggest that such an algorithm may well be designed, 
which guarantees optimal colorings for the purpose of a good allocation. Despite 
the fact that graph coloring is an NP-complete problem, the input graphs in 
the case of register allocation certainly seem to be efficiently colored, even when 
using an exact algorithm. 

In the implementation of the sequential coloring algorithm, none of the typ- 
ical improvements designed for register allocation, such as copy propagation by 
coalescing, graph simplification by vertex removal /merging, or interference re- 
duction by live range splitting, have been accounted for. Our original purpose 
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Fig. 2. Execution times for the two algorithms, coloring the 23,000 smallest graphs. 
The dots and the lower function estimate correspond to the exact algorithm; the pluses 
and the upper estimate correspond to the George-Appel allocator. The functions are 
second degree polynomials estimated from all the measured points using the least- 
squares method. 



was simply to determine whether the graphs were 1- perfect, since this could have 
the effect of making optimal colorings efficiently computable. 

In order to become applicable for register allocation, the algorithms need to 
implement such functionality. After all, most of the graphs investigated have 
much too large chromatic numbers, when not simplified, to fit into the register 
file of most processors. Even if the processor has a large register file, it is still 
desirable that programs do not use more registers than necessary, since loads 
and stores, made for instance at procedure calls, suffer most considerably when 
having to switch large numbers of registers into and out of memory. 

In order to complete the goal of improving the register allocation algorithms, 
some questions remain to be answered: 

— Do coalescing, merging, or splitting, the way we use these improvements in 
register allocators, jeopardize the 1-perfectness of the graphs? 

— Is the 1- perfectness of interference graphs provable? 

— Can we perhaps further strengthen the constraints in order to restrict the 
graph classes towards perfectness? 
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— How expensive does the exact coloring algorithm become if the graphs are 
not 1-perfect? 

— Is it at all possible to implement an efficient register allocator that contains 
the different graph simplifications, and still guarantees the optimality of the 
produced colorings? 

We believe that the answer to the last of these questions may well be positive, and 
our work will be continued with the goal of achieving such an implementation. 
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A A Simple Example 

Fig. m presents parts of a sample compiling session using the Cederberg compiler. 
First a high-level source code is input to the front-end. The intermediate repre- 
sentation produced by the front-end is presented to the right. For the purpose of 
simplicity, the user I/O functions, get and put, are assumed to be implemented 
as processor instructions. 

Temporary variables in the program are named tl, t2, t3,. . . ; basic blocks 
are labeled xl, x2, x3,. . . . 

The IR from the front-end is transformed into SSA form and is subject to two 
optimizations, constant propagation with conditional branches PI and partial 
redundancy elimination P], the result of which is shown down to the left on 
normal form, i.e., transformed back from SSA. 

Analyzing the live ranges of the variables, inserting an edge between ranges 
that interfere, gives the interference graph presented down to the right. The 
graph has the maximum clique 

Q = {tll,tl6,tl7,t21,t22}, u; = 5, 

and we conclude, directly from the figure, that the chromatic number x is no 
larger than uj. Hence the graph is 1-perfect. 
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int f() 

{ 

int a; 
int b; 
int c; 

a = getO; /* from user */ 
if (a < 0) 

return -a; 
b = 1024; 
c = b / 1024; 

while (b < a) 

{ 

c = c + a*a + b + a; 
b = b - 1; 

} 

put(c); /* to user */ 
return 0; 

} 



xl: 


get 






tl 




mov 


tl 




a 




sit 


a 


0 


t2 




bf 


t2 




x3 


x2 : 


neg 


a 




t3 




ret 


t3 






x3 : 


mov 


1024 




b 




div 


b 


1024 


t4 




mov 


t4 




c 




ba 






x5 


x4: 


mul 


a 


a 


t5 




add 


c 


t5 


t6 




add 


t6 


b 


t7 




add 


t7 


a 


t8 




mov 


t8 




c 




sub 


b 


1 


t9 




mov 


t9 




b 


x5 : 


sgt 


b 


a 


tio 




bt 


tio 




x4 


x6 : 


put 


c 








ret 


0 








Fig. 3. Top-left: A high-level source code which is input to the compiler front-end. 
Top-right: IR output from the front-end. Bottom-left: The final improved IR from the 
optimizer. Bottom-right: IG with the maximum clique Q of size a; = 5 shown with 
shaded vertices. Since, apparently, x — ^ fhe graph is 1-perfect. 
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Abstract. This paper presents a semantics-based compilation model 
for an aspect-oriented programming language based on its operational se- 
mantics. Using partial evaluation, the model can explain several issues in 
compilation processes, including how to find places in program text to in- 
sert aspect code and how to remove unnecessary run-time checks. It also 
illustrates optimization of calling-context sensitive pointcuts (cf low), im- 
plemented in real compilers. 



1 Introduction 

This work is part of a larger project, the Aspect SandBox (ASB), that aims to 
provide concise models of aspect-oriented programming (AGP) for theoretical 
studies and to provide a tool for prototyping alternative AGP semantics and 
implementation techniqnel^. 

In this paper we report one result from the ASB project — an operational- 
semantics based explanation of the compilation and optimization strategy for 
Aspect J-like languages |5|9J . To avoid difficulties to develop formal semantics 
directly from artifacts as complex as Aspect J, we used a simplified language. 
It yet has sufficient features to discuss compilation and optimization of real 
languages. 

The idea is to use partial evaluation to perform as many tests as possible at 
compile-time, and to insert applicable advice bodies directly into the program. 
Gur model also explains the optimization used by the AspectJ compiler for 
calling-context sensitive pointcuts (cflow and cflowbelow). 

Some of the issues our semantic model clarifies include: 

— The mapping between dynamic join points and the points in the program 
text, or join point shadows^ where the compiler actually operates. 

— What dispatch can be ‘compiled-out’ and what must be done at runtime. 

— The performance impact different kinds of advice and pointcuts can have on 
a program. 

— How the compiler must handle recursive application of advice. 

* An early version of the paper was presented at FOAL 2002, Workshop on Founda- 
tions of Aspect-Oriented Languages at AOSD 2002. 

^ http : //www . cs . ubc . ca/labs/spl/pro j ects/asb . html 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 46^60] 2003. 
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1.1 Join Point Models 

Aspect-oriented programming (AOP) is a paradigm to modularize crosscutting 
concerns [To]. An AO program is effectively written in multiple modularities — 
concerns that are local in one are diffuse in another and vice-versa. Thus far, 
several AOP languages are proposed [3l9ll3ll4] . 

The ability of an AOP language to support crosscutting lies in its join point 
model (JPM). A JPM consists of three elements: 

— The join points are the points of reference that programs including aspects 
can affect. Lexical join points are locations in the program text {e.g., “the 
body of a method”). Dynamic join points are run-time actions, such as events 
that take place during execution of the program {e.g., “an invocation of a 
method”). 

— A means of identifying join points, {e.g., “the bodies of methods in a par- 
ticular class,” or “all invocations of a particular method”) 

— A means of effecting at join points, {e.g., “run this code beforehand”) 

In this paper, we will be working with a simplified JPM similar to the one 
from Aspect J. (See Section [TTI for details.) 

The rest of the paper is organized as follows. Section |2] introduces our sim- 
plified JPM, namely Pointcut and Advice (PA), and shows its interpreter. Sec- 
tion E] presents a compilation scheme for PA excluding context-sensitive point- 
cuts, which are deferred to Section |D Section El relates our study to other formal 
studies in AOP and other compilation schemes. Section |H1 concludes the paper 
with future directions. 

2 PA: Dynamic Join Point Model AOP Language 

This section introduces our small join point model, namely Pointcut and Advice 
(PA), which implements core features of the Aspect J’s dynamic join point model. 
PA is modeled as an AOP extension to a simple object-oriented language. Its op- 
erational semantics is given as an interpreter written in Scheme. A formalization 
of a procedural subset of PA is presented by Wand, Kiczales and Dutchyn[TH]. 



2.1 Informal Semantics 

We first informally present the semantics of PA. In short, PA is a dynamic join 
point model that covers core features of Aspect J on top of a simple object- 
oriented language with classes, objects, instance variables, and methods. 



Object Semantics. Figure^ is an example progranQ. For readability, we use a 
Java-like syntax in the papeiij. It defines a Point class with one integer instance 
variable x, a unary constructor, and three methods set, move and main. 

^ For simplicity later in the paper, we are using one-dimensional points as an example. 
^ Our implementation actually uses an S-expression based syntax. 
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class Point { 
int x; 

Point (int ix) { this . set (ix) ; } 

void set (int newx) { this.x = newx; } 
void move (int dx) { this . set (this . x + dx) ; } 
void mainO { Point p = new Point (1); 

p.move(5); write(p.x); newlineO; } } 



Fig. 1. An Example Program, (write and newline are primitive operators.) 



When method main of a Point object is executed, it creates another Point 
object, and runs the constructor body. The main method then invokes method 
move on the created object, reads the value of variable x of the object and 
displays it. 



Aspect Semantics. To explain the semantics of AOP features, we first define 
the PA join point model. 

Join Point. The join point is an action during program execution, including 
method calls, method executions, object creations, and advice executions. (Note 
that a method invocation is treated as a call join point at the caller’s side and 
an execution join point at the receiver’s side.) The kind of the join point is the 
kind of action {e.g., call and execution). 

Means of Identifying Join Points. The means of identifying join points is the 
pointcut mechanism. A pointeut is a predicate on join points, which is used to 
specify the join points that a piece of advice applies to. There are five kinds of 
primitive pointcuts, namely call(m), execution (m) , new(m), target (t v) , 
and args(t v, . . .), three operators (&&, I I and !), and two higher-order point- 
cuts, namely cflow(p) and cf lowbelow(p) . 

The first three primitive pointcuts (call, execution, and new) match join 
points that have the same kind and signature as the pointcut. The next two 
primitive pointcuts (target and args) match any join point that has values of 
specified types. The three operators logically combine or negate pointcuts. The 
last two higher-order pointcuts match join points that have a join point matching 
their sub-pointcuts in the call-stack. These are discussed in Section 2] in more 
detail. Interpretation of pointcuts is formally presented in other literature[19IJ. 

Means of Effeeting at Join Points. The means of effecting at join points is the 
advice mechanism. A piece of adviee contains a pointcut and a body expression. 
When a join point is created, and it matches the pointcut of the advice, the advice 
body is executed. There are two types of advice, namely before and after0. A 

^ For simplicity we omit around advice and after returning advice which can inspect 
return values. However, our experimental implementation actually supports those 
types of advice. 
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(define eval 
(lambda (exp env jp) 

(cond ((const-exp? exp) (const-value exp)) 

((var-exp? exp) (lookup env (var-name exp))) 

((call-exp? exp) (call (call-signature exp) 

(eval (call-target exp) env jp) 
(eval-rands (call-rands exp) env jp) jp)) 

...))) 

(define call 

(lambda (sig obj args jp) 

(execute (lookup-method (object-class obj) sig) obj args jp))) 

(define execute 

(lambda (method this args jp) 

(eval (method-body method) 

(new-env (append Mthis 7ohost) (method-params method)) 

(append (list this (method-class method)) args)) 

jp))) 



Fig. 2. Expression Interpreter. 



before advice runs before the original action is taken place. Similarly, the after 
runs after the completion of the original action. 

The following example advice definition lets the example program to print a 
message before every call to method set: 

before : call (void Point . set (int) ) && args(int z) 

{ writeC'set : ") ; write(z); newlineO; } 

It consists of a keyword for the kind of the advice (before), a pointcut, and a 
body in braces. The pointcut matches join points that call method set of class 
Point, and the args sub-pointcut binds variable z to the argument to method 
set. The body of the advice prints messages and the value of the argument. 

When the Point program is executed together with the above advice, the 
advice matches to the call to set twice (in the constructor and in method set), 
it thus will print “set:l”, “set: 6” and “6”. 



2.2 Interpreter 

The interpreter consists of an expression interpreter and several definitions for 
AOP features including the data structure for a join point, wrappers for creating 
join points, a weaver, and a pointcut interpreter. 



Expression Interpreter. Figure El shows the core of the expression interpreter 
excluding support for AOP features. The main function eval takes an expression, 
an environment, and a join point as its parameters. The join point is an execution 
join point at the enclosing method or constructor. 

An expression is a parsed abstract syntax tree, which can be tested with 
const-exp?, etc., and can be accessed with const-value, etc. An environment 
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(define call 

(lambda (sig obj args jp) 

(weave (make-jp ^call sig obj args jp) 

(lambda (args jp) ...body of the original call...) 
args))) 



Fig. 3. A Wrapped Interpreter Function. 



binds variables to mutable cells. An object is a Scheme data structure that has 
a class information and mutable fields for instance variables. 

The body of eval is a simple case-based test on expression types. Some 
operations are defined as separated functions for the later extension of AOP 
features. 

Join Point. A join point is a data structure that is created upon an action in 
the expression interpreter^ 

(def ine-struct jp (kind name target args stack)) 

The kind field specifies the kind of the join point as a symbol {e.g., ^ call). The 
name field has the name of the method being called. The target and args fields 
have the target object and the arguments of the method invocation, respectively. 
The stack field will be explained in Section |4] 



Wrapper. In order to advice actions performed in the expression interpreter, we 
wrap the interpreter functions so that they create dynamic join points. Figure |3] 
shows how call — one of such a function — is wrapped. When a method is to be 
called, the function first creates a join point that represents the call action and 
applies it to weave, which executes advice applicable to the join point (explained 
below). The lamb da- closure passed to weave defines the action of the original 
call, which is executed during the weaving process. 

Likewise, the other functions including method execution, object creation, 
and advice execution (defined later) are wrapped. 



Weaver. Figure [4| shows the definition of the weaver. Function weave takes a 
join point, a lambda-closure for continuing the original action, and a list of argu- 
ments to the closure. It also uses advice definitions in global variables (*bef ores* 
and *afters*). It defines the order of advice execution; it executes bef ores first, 
then the original action, followed by afters last. 

Function call-bef ores/afters processes a list of advice. It matches the 
pointcut of each piece of advice against the current join point, and executes the 
body of the advice if they match. In order to advise execution of advice, the 

^ This non-standard Scheme construct defines a structure named jp with five fields 
named kind, name, target, args, and stack. 
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(define weave 

(lambda (jp action args) 

(call-bef ores/afters *befores* args jp) 

(let ((result (action args jp))) 

(call-bef ores/afters *afters* args jp) 
result) ) ) 

(define call-bef ores/afters 
(lambda (advs args jp) 

(for-each (call-bef ore/af ter args jp) advs))) 

(define call-before/after 
(lambda (args jp) 

(lambda (adv) 

(let ((env (point cut -match? (advice-pointcut adv) jp))) 
(if env (execute-bef ore/af ter adv env jp)))))) 

(define execute-bef ore/af ter 
(lambda (adv env jp) 

(weave (make-jp ^aexecution adv #f #f ’() jp) 

(lambda (args jp) (eval (advice-body adv) env jp)) 

M)))) 



Fig. 4. Weaver. 



(define pointcut -match? 

(lambda (pc jp) 

(cond ((and (call-pointcut? pc) (call-jp? jp) 

(sig-match? (pointcut-sig pc) (jp-name jp))) 
(make-env M) M))) 

((and (args-pointcut? pc) 

(types-match? (jp-args jp) (pointcut-arg-types pc))) 
(make-env (pointcut-arg-names pc) (jp-args jp))) 

(else #f)))) 



Fig. 5. Pointcut Interpreter. 



function execute-before/after is also wrapped. The lambda-closure in the 
function actually executes the advice body. 

Calling around advice has basically the same structure for the before and 
after. It is, however, more complicated due to its interleaved execution for the 
proceed mechanism. 



Pointcut interpreter. The function point cut -match? in Figure |5] matches a 
pointcut to a join point. Due to space limitations, we only show rules for two 
types of pointcuts. The first clause of the cond matches a call(m) pointcut to a 
call join point that has a matching name field matches to m. It returns an empty 
environment that represent ‘true’. The second clause matches an args(t x, . . .) 
pointcut to any join point when args filed has values of types t, . . .. The result 
in this case is an environment that binds variables x, . . . to the values in the 
args field. The last clause returns false for unmatched cases. 
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3 Compiling Programs by Partial Evaluation 

Our compilation scheme is to partially evaluate an interpreter, which is known 
as the first Futamura projection^. Given an interpreter of a language and a pro- 
gram to be interpreted, partial evaluation of the interpreter with respect to the 
subject program generates a compiled program (called a residual program). By 
following this scheme, partial evaluation of an AOP interpreter with respect to a 
subject program and advice definitions would generate a compiled, or statically 
woven program. 

The effect of partial evaluation is removal of unnecessary point cut tests. 
While the interpreter tests-and-executes all pieces of advice at each dynamic 
join point, our compilation scheme successfully inserts only applicable advice to 
each shadow of join points. This is achieved in the following way: 

1. Our compilation scheme partially evaluates the interpreter with respect to 
each method definition. 

2. The partial evaluator (PE) processes the expression interpreter, which vir- 
tually walks over the expressions in the method. All shadows of join points 
are thus instantiated. 

3. At each shadow of join points, the PE further processes the weaver. Using 
statically given advice definitions, it (conceptually) inserts test-and-execute 
sequence of all advice. 

4. Eor each piece of advice, the PE reduces the test-and-execute code into a 
conditional branch that has either a constant or dynamic value as its con- 
dition, and the advice body as its then-clause. Depending on the condition, 
the entire code or the test code may be removed. 

5. The PE processes the code that executes the advice body. It thus instantiates 
shadows of join points in the advice body. The steps from [3| recursively 
compiles ‘advised advice execution.’ 

We used PGG, an offline partial evaluator for Scheme for partial evalu- 
ation. 



3.1 How the Interpreter Is Partially Evaluated 

An offline partial evaluator processes a program in the following way. It first 
annotates expressions in the program as either static or dynamic^ based on their 
dependency on the statically known parameters. Those annotations are often 
called binding-times. It then processes the program by actually evaluating static 
expressions and by returning symbolic expressions for dynamic expressions. The 
resulted program, which is called residual program^ consists of dynamic expres- 
sions in which statically computed values are embedded. 

This subsection explains how the interpreter is partially evaluated with re- 
spect to a subject program, by emphasizing what operations can be performed 
at partial evaluation time. Although the partial evaluation is an automatic pro- 
cess, we believe understanding this process is crucially important for identifying 
compile-time information and also for developing better insights into design of 
hand- written compilers. 
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Compilation of Expressions. The essence of the first Futamura projection is 
to evaluate computation involving exp away. In fact, occurrences of exp in the 
interpreter are annotated as static except for the first argument to execute in 
function call. The argument is dynamic due to the nature of dynamic dispatch- 
ing in object-oriented languages. We therefore invoke the partial evaluator for 
each method definition, and replaced the function execute with the one that 
dynamically dispatches on a receiver’s type. This standard partial evaluation 
technique is known as ‘The Trick.’ 

The environment (env) is regarded as a partially-static data structure; i.e., 
the variable names are static and the values are dynamic. As a result, the partial 
evaluator compiles variable accesses in the subject program into accesses to 
elements of the argument list in the residual code. 

Compilation of Advice. As is mentioned at the beginning of the section, our 
compilation scheme inserts advice bodies into their applicable shadows of join 
points with appropriate guards. Below, we explain how this is done by the partial 
evaluator. 

1. A wrapper {e.g., Figure[3|) creates a join point upon calling weave. The first 
two fields of the join point, namely kind and name, are static because they 
only depend on the program text. The rest fields have values computed at 
run-time. We actually split the join point into two data structures so that 
static and dynamic fields are stored separately. With partial evaluators that 
support partially static data structures |1], we would get the same result 
without splitting the join point structure. 

2. Function weave (Figure |4]) is executed with the static join point, an 
action, and dynamic arguments. Since the advice definitions are stati- 
cally available, the partial evaluator unrolls the for-each in in function 
eval-bef ores/ afters. 

3. The result of point cut -match? can be either static or dynamic depend- 
ing on the type of a pointcut. Therefore, the test-and-execute sequence (in 
eval-bef ore/after) becomes one of the following three: 

Statically false: No code is inserted into compiled code. 

Statically true: The body of the advice is partially evaluated; ie., the 
body is inserted in compiled code. 

Dynamic: Partial evaluation of point cut -match? generates an if expres- 
sion with the body of advice in the then-clause and an empty else-clause. 
Essentially, the advice body is inserted with a guard. 

4. In the statically true or dynamic cases at the above step, the partial evaluator 
processes the evaluation of the advice body. If the advice is applicable to more 
than one join point shadows in a method, the compiled body is shared as a 
Scheme function thanks to a mechanism in the partial evaluator. Since the 
wrapper of the execute-bef ore/after calls weave, application of advice to 
the advice body is also compiled. 

5. When the original action is evaluated, the residual code of the original ac- 
tion is inserted. This residual code from weave will thus have the original 
computation surrounded by applicable advice bodies. 
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(define point -move 

(lambda (thisl args2 jp3) 

(let* ((jp4 (make-jp ^execution ^move thisl args2 jp3)) 
(args5 (list (+ (get-field thisl ^x) (car args2)))) 
(jp6 (make-jp ^call ^set thisl args5 jp4))) 

(if (type-match? args5 Hint)) 

(begin (write "set:") (write (car args5)) (newline))) 
(execute* (lookup-method (object-class thisl) ^set) 
thisl args5 jp6)))) 



Fig. 6. Compiled code of move method of Point class. 



Compilation of Pointcut. In stepOabove, point cut -match? is partially eval- 
uated with a static pointcut and static fields in a join point. The partial evalua- 
tion process depends on the type of the pointcut. For pointcuts that depend on 
only static fields of a join point {e.g., call), the condition is statically computed 
to either an environment or false. For pointcuts that test values in the join point 
{e.g., target), the partial evaluator returns residual code that dynamically tests 
the types of the values in the join point. For example, when point cut -match? 
is partially evaluated with respect to args(int x), the following expression is 
returned as the residual code: 

(if (types-match? (jp-args jp) Hint)) 

(make-env Hx) (jp-args jp)) 

#f) 

Logical operators (namely &&, I I and ! ) are partially evaluated into an ex- 
pression that combines the residual expressions of its sub-point cuts. The remain- 
ing two pointcuts (cf low and cf lowbelow) are discussed in the next section. 

The actual point cut -match? is written in a continuation-passing style so 
that partially evaluator can reduce a conditional branch in call-before/after 
for the static cases. This is a standard technique in partial evaluation, but is 
crucially important to get right results. 

3.2 Compiled Code 

Figure [SI shows the compiled code for Point, move combined with the advice 
given in Section EH] For readability, we post-processed the residual code by elim- 
inating dead code, propagating constants, renaming variable names, combining 
split join point structures, and so forth. The post-process was done automatically 
except for renaming and combining. 

The compiled function first creates a join point jp4 for the method execution, 
a parameter list and a join point jp6 for the method call. The if expression 
is the advice body with a guard. The guard checks the residual condition for 
args pointcut. (Note that no run-time checks are performed for call pointcut.) 
If matched, the body of the advice is executed. Finally, the original action is 
performed. 
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As we see, advice execution is successfully compiled. Even though there is 
a shadow of execution join points at the beginning of the method, no advice 
bodies are inserted in the compiled function as it does not match any advice. 

4 Compiling Calling-Context Sensitive Pointcuts 

As briefly mentioned before, cflow and cflowbelow pointcuts can investigate 
join points in the call-stack; i.e., their truth value is sensitive to calling context. 
Here, we first show a straightforward implementation that is based on a stack 
of join points. It is inefficient, however, and can not be compiled properly. 

We then show an optimized implementation that can be found in AspectJ 
compiler. The implementation exploits incremental natures of those pointcuts, 
and is presented as a modified version of PA interpreter. We can also see those 
pointcuts can be properly compiled by using our compilation scheme. 

To keep discussion simple, we only explain cflow in this section. Extending 
our idea to cflowbelow is easy and actually done in our experimental system. 

4.1 Calling- Context Sensitive Pointcut: cflow 

A pointcut cflow(p) matches to any join points if there is a join point that 
matches to p in its call-stack. The following definition is an example advice 
that uses a cflow pointcut. The cflow pointcut matches join points that are 
created during method calls to move. When this pointcut matches a join point, 
the argsCint w) sub-pointcut gets the parameter to move from the stack. 

after : call (void Point . set (int) ) 

&& cf low(call(void Point .move (int) ) && args(int w)) 

{ writeC'under move:”); write(w); newlineO; } 

As a result, execution of the Point program with two pieces of advice pre- 
sented in Section l2Tl and above prints “set : 1” first, “set: 6” next, and then 
“under move: 5” followed by “6” last. The call to set from the constructor is 
not advised by the advice using cflow. 

4.2 Stack-Based Implementation 

A straightforward implementation is to keep a stack of join points and to examine 
each join point in the stack from the top when cflow is evaluated. 

We use the stack field in a join point to maintain the stack. Whenever a 
new join point is created, we record previous join point in the stack field (as is 
done as the last argument to make-jp in EigureEJ. Since join points are passed 
along method calls, the join points chained by the stack field from the current 
one form a stack of join points. Restoring old join points is implicitly achieved 
by merely using the original join point in the caller’s continuation. 

The following definition shows the algorithm to interpret cf low(p) that sim- 
ply runs down the stack until it finds a join point that matches to p. If it reaches 
the bottom of the stack, the result is false. 
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(define point cut -match? 

(lambda (pc jp) 

(cond ( (cf low-pointcut? pc) 

(let loop ((jp jp)) 

(and (not (bottom? jp)) 

(or (pointcut -match? (pointcut-body pc) jp) 

(loop (jp-stack jp)))))) 

...))) 

The problem with this implementation is run-time overhead. In order to 
manage the stack, we have to pusl0 a join point each time a new join point is 
created. Evaluation of cf low takes linear time in the stack depth at worst. When 
cf low pointcuts in a program match only specific join points, keeping the other 
join points in the stack and testing them is waste of time and space. 



4.3 State-Based Implementation 

A more optimized implementation of cf low in Aspect! compiler is to exploit its 
incremental nature. This idea can be explained by an example. Assume (as shown 
previously) that there is pointcut “cf low (call (void Point .move(int) ) )” in a 
program. The pointcut becomes true once move is called. Then, until the control 
returns from move (or another call to move is taken place), the truth value of 
the pointcut is unchanged. This means that the system only needs managing the 
state of each cf low(p) and updating that state at the beginning and the end of 
join points that make p true. Note that the state should be managed by a stack 
because it has to be rewound to its previous state upon returning from actions. 
This state-based optimization can be explained in the following regards: 

— It avoids repeatedly matching cf low bodies to the same join point in the 
stack by evaluating bodies of cf low upon creation of each join point, and 
recording the result. 

— It makes static evaluation {i.e., compilation) of cflow bodies possible be- 
cause they are evaluated at each shadow of join points. As a result, manage- 
ment of a cflow state is only taken place at shadows of join points matching 
to the body of cflow. 

— It evaluates cflow pointcut in constant time by merely peeking the top of a 
stack of states for each cflow pointcut. 

It is straightforward to implement this idea in the PA interpreter. Figure [7] 
outlines the algorithm. Before running a subject program, the system collects all 
cflow pointcuts in the program, including those appear inside of other cflow 
pointcuts, and stores in a global variable *cf low-pointcuts*. The system also 
gives unique identifiers to them, which are accessible via point cut -id. We re- 
name the last field of a join point from stack to state, so that it stores the 
current states of all cflow pointcuts. 



By having a pointer to ‘current’ join point in parameters to each function, pop can 
be automatically done by returning from the function. 



A Compilation and Optimization Model for Aspect- Oriented Programs 



57 



(define weave 

(lambda (jp action args) 

(let ((new-jp (update-states *cf low-pointcuts* jp))) 

...the body of original weave...))) 

(define update-states 
(lambda (pcs jp) 

(fold (lambda (pc njp) ;; fold: ( ^ a* ^b-> ^ a) * ^ a* ^b list->^a 
(let ((env (pointcut-match? (pointcut-body pc jp)))) 

(if env (update-state njp (pointcut-id pc) env) 
njp))) 

jp pcs))) 

(define pointcut-match? 

(lambda (pc jp) 

(cond ( (cf low-pointcut? pc) (lookup-state jp (pointcut-id pc))) 

...))) 



Fig. 7. State-based Implementation of cf low. (update-state jp id new-state) re- 
turns a copy of jp in which id’s state is changed to new-state . (lookup-state jp 
id) returns the state of id in jp . 



When evaluation of an expression creates a join point, it first updates 
the states of all cflow pointcuts by wrapping weave by calling function 
update-states. The function update-states evaluates the body of each cflow 
pointcut, and updates the state only if the result is true. Otherwise, the state 
is unchanged. Therefore, after partial evaluation, the code for updating state is 
also eliminated when the body of the cflow is statically determined as false. 
The conditional case for cflow pointcuts in pointcut-match? merely looks up 
the state in the current join point. 

Support for cf lowbelow pointcuts is to extend the state to a pair of states. 
We omit details due to space limitation. 

Those two stack- and state-based implementations can also be understood as 
initial- and final-algebra representations [TB] etc.] of join points. The stack-based 
implementation defines a join point as the following data structure: 

(def ine-struct jp (kind name target args stack)) 

By noticing that the stack field of join points is accessed only for matching 
the join points to the cflow pointcuts in the program, the structure can take a 
final-algebra representation: 

(def ine-struct jp (kind name target args rl r2 ... rn)) 

where ri is the result of pointcut-match? on the i’th cflow pointcut in the 
program. This is exactly what we have done for the state-based implementation. 

4.4 Compilation Result 

Figures m shows excerpts of compiled code for the Point program with the two 
advice definitions shown before. The compiler gave _gl to the cflow pointcut 
as its identifier. 
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(let* ((val7 ...create a point object ...); compiled code of p. move (5) 

(args9 '(5)) 

(jp8 (make-jp thisl args9 (jp-state jp3)))) 

(if (types-match? args9 ^(int)) 

(begin (execute* (lookup-method (object-class val7) ^move) 
val7 args9 

(state-update jp8 ^_gl (new-env Hw) args9))) 

. . . write and newline . . . ) 

. . . omitted . . . ) ) 

(define point -move ; compiled code of Point, move 

(lambda (thisl args2 jp3) 

(let* ((args5 (list (+ (get-field thisl ^x) (car args2)))) 

(jp6 (make-jp thisl args5 (jp-state jp3)))) 

(if (types-match? args5 ^(int)) 

(begin (write "set:") (write (car args5)) (newline) 

(let* ((val7 (execute* (lookup-method (object-class thisl) 

^ set) 

thisl args5 jp6)) 

(env8 (state-lookup jp6 ^_gl))) 

(if env8 (begin (write "under move:") 

(write (lookup env8 ^w)) (newline))) 

val7) ) 

. . . omitted. . . ) ) ) ) 



Fig. 8. Compiled code of p. move (5) and Point. main with cflow advice. 



The first expression corresponds to p. move (5); in Point. main. Since the 
method call to move makes the state of the cflow to true, the compiled code 
updates the state of _gl to an environment created by args pointcut in the join 
point, and passes the updated join point to the method. 

The next function shows the compiled move method. The second if expres- 
sion and the preceding state-lookup are for the advice using cflow. It evaluates 
the cflow pointcut by merely looking its state up, and runs the body of advice 
if the pointcut is true. The value of variable w, which is bound by args pointcut 
in cflow, is taken from the recorded state of cflow pointcut. Since the state is 
updated when move is to be called, it gives the argument value to move method. 

To summarize, our scheme compiles a program with cflow pointcuts into one 
with state update operations at each join point that matches the sub-pointcut 
of each cflow pointcut, and state look-ups in the guard of advice bodies. By 
comparing the compiled code with the one generated by AspectJ compiler, we 
observe that those two compilation frameworks insert update operations for the 
cflow states into the same places. 

5 Related Work 

In reflective languages, some crosscutting concerns can be controlled through 
meta-Drogramming [l8II6j . Several studies successfully compiled reflective pro- 
grams by using partial evalnation |2IIIII2| . It is more difficult to ensure successful 



A Compilation and Optimization Model for Aspect- Oriented Programs 



59 



compilation in reflective languages because the programmer can write arbitrary 
meta-programs. 

Wand, Kiczales and Dutchyn presented a formal model of the procedural 
version of PAHS]. Our model is based on this, and used it for compilation and 
optimizing cf low pointcuts. 

Douence et al. showed an operational semantics of an AOP system[S|. In 
their system, a ‘monitor’ pattern matches a stream of events from a program 
execution, and invokes advice code when matches. A program transformation 
system inserts code into the monitored program so that it triggers the monitor. 
In our scheme, partial evaluator automatically performs this insertion. 

Andrews proposed process algebras as a formal basis of AOP languages [T], in 
which advice execution is represented as synchronized processes. ‘Compilation’ 
can be understood as removal of the synchronization. However, our experience 
suggests that transformation techniques as powerful as partial evaluation would 
be necessary to properly remove run-time checks. 

6 Conclusion and Future Work 

In this paper, we presented a compilation model to an aspect-oriented program- 
ming (AOP) language based on operational semantics and partial evaluation 
techniques. The model explains issues in AOP compilers including identifying 
join point shadows, compiling-out pointcuts and recursively applying advice. It 
also explains the optimized cf low implementation in Aspect! compiler. 

The use of partial evaluation allows us to keep simple operational semantics 
and to relate the semantics to compilation. It also helped us to understand the 
data dependency in our interpreter by means of its binding-time analysis. We 
believe this approach would be also useful to prototyping new AOP features with 
effective compilation in mind. 

Although our language supports only core features of practical AOP lan- 
guages, we believe that this work could bridge between formal studies and prac- 
tical design and implementation of AOP languages. 

Future directions of this study could include the following topics. Optimiza- 
tion algorithms could be studied for AOP programs based on our model, for 
example, elimination of more run-time checks with the aid of static analysis [E]. 
Our model could be refined into more formal systems so that we could relate 
between semantics and compilation with correctness proofs. Our system could 
also be applied to design and test new AOP features. 
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Abstract. Many processes can be seen as transformations of tree-like 
data structures. In compiler construction, for example, we continuously 
manipulate trees and perform tree transformations. This paper intro- 
duces a pattern matching compiler (Tom): a set of primitives which add 
pattern matching facilities to imperative languages such as C, Java, or 
Eiffel. We show that this tool is extremely non- intrusive, lightweight and 
useful to implement tree transformations. It is also flexible enough to 
allow the reuse of existing data structures. 



1 Introduction 

For the compiler construction, there is an obvious need for programming trans- 
formation of structured documents like trees or terms: parse trees, abstract syn- 
tax trees (ASTs for short). In this paper, our aim is to present a tool which is 
particularly well-suited for programming various transformations on trees/terms. 
In the paper we will often talk about “term” instead of “tree” due to the one- 
to-one correspondence between these two notions. Our tool results from our ex- 
perience on using existing programming languages and programming paradigms 
to implement transformations of terms. 

In declarative (logic/functional) programming languages, we may find some 
built-in support to manipulate structured expressions or terms. For instance, in 
functional programming, a transformation can be conveniently implemented as a 
function declared by pattern matching ,where a set of patterns represents 

the different forms of terms we are interested in. A pattern may contain vari- 
ables (or holes) to schematize arbitrary terms. Given a term to transform, the 
execution mechanism consists in finding a pattern that matches the term. When 
a match is found, variables are initialized and the code related to the pattern is 
executed. Thanks to the mechanism of pattern matching, one can implement a 
transformation in a declarative way, thus reducing the risk to implement it in 
the wrong way. 

For efficiency reasons, it may be interesting to implement similar tree-like 
transformations using (low-level) imperative programming languages for which 
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efficient compilers exist. Unfortunately, in such languages, there are no built-in 
facilities to manipulate term structures and to perform pattern matching. There 
are two common solutions to this problem. 

One possibility would be to enrich an existing imperative programming lan- 
guage with pattern matching facilities |5| 7 | 8 |13|. This hard- wired approach ties 
users to a specific programming language. The situation is thus little better 
than that in declarative languages. Furthermore, because terms are built-in, 
user-defined data structures must be converted to the term structure. Such mar- 
shalling complicates the user’s program, and it incurs a significant performance 
penalty. 

A simpler solution would be to develop a special library implementing pattern 
matching functionality. This approach is followed for example in the Asf+Sdf 
group [To] where a C library called ATerms [9] has been developed. In this 
library, pattern matching is implemented via a function called ATmatch, which 
consists in matching a term against a single pattern represented by a string or a 
term. Therefore, it is possible to define a transformation by pattern matching, 
thanks to a sequence of if-then-else instructions, where each condition is a 
call to the ATmatch function. But this approach has three drawbacks. First, 
matching is performed sequentially: patterns are tried one by one. This can be 
rather inefficient for a large number of patterns. Second, terms and patterns are 
untyped, and thus may be error prone. Third, the programming language and 
the data structure are imposed by the library, and so the programmer cannot 
use his favorite language as well as his own data structure to represent terms. 

To solve the deficiencies of the above two solutions, and in particular for the 
sake of efficiency, we are interested in the compilation of pattern matching. By 
compilation we mean an approach where all patterns are compiled together pro- 
ducing a matching automaton. This automaton then performs matching against 
all patterns simultaneously. Our research on this topic is guided by the following 
concerns: 

— How to efficiently compile different forms of pattern matching? We are con- 
cerned by simple syntactic matching but also by matching modulo an equa- 
tional theory. In such a case, for example a pattern x+3 can match expression 
3+7 thanks to commutativity of plus. 

— How to implement compilation of pattern matching in a uniform way for a 
large class of programming languages and for any representation of terms? 

To tackle the above mentioned problems, we develop a non-intrusive pattern 
matching compiler called Tom. Its design follows our experiences on the efficient 
compilation of rule-based systems mi\- Our tool can be viewed as a Yacc- 
like compiler translating patterns into executable pattern matching automata. 
Similarly to Yacc, when a match is found, the corresponding “semantic action” 
(a sequence of instructions written in an imperative language) is triggered and 
executed. In a way, we can say that Tom translates a declarative-imperative 
function - defined by pattern matching and imperative instructions - into a fully 
imperative function. The resulting function can be integrated to an application 
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written in a classical language such as C, Java, or Eiffel, called the target language 
in the rest of the document. In this paper, we illustrate the different advantages 
of the approach implemented by Tom, namely: 

Efficiency. The gain of efficiency follows from the compilation of matching as 
implemented in Tom. 

Flexibility. When trying to integrate a black-box tool in an existing system, 
one of the main bottlenecks comes from data conversion and the flexibility 
offered to the user. One of the main originalities of our system is its inde- 
pendence of term representation. The programmer can use (or re-use) his 
own data structures for terms/trees and then execute matching upon those 
data structures. We propose to access terms using only a simple Applieation 
Programming Interfaee (API) defined by the user. 

Generality. Tom is able to consider multiple target languages (C, Java, and 
Eiffel). Tom is implemented in Tom itself as a series of AST transforma- 
tions. The code generation is performed at the very end, depending on the 
target language we are interested in. Hence, the target language is really a 
parameter of Tom. 

Expressivity. Tom supports non-linear patterns and equational matching like 
modern rule-based programming languages. Currently, we have implemented 
pattern matching with list operators. This form of associative matching with 
neutral element is very useful for practical applications. The main difference 
with standard (syntactic) matching is that a single variable may have mul- 
tiple assignments. 

The paper is organized as follows: Section [2] motivates the main features of 
Tom on a very simple example. In Section |3] we present the main language con- 
structs and their precise meanings. Further applications are described in Sec- 
tion [H Since Tom is non-intrusive, it can be used in the context of existing 
applications to implement in a declarative way some functionalities which can 
be naturally expressed as transformations of terms (Section ITTI) . Furthermore, 
we show how Tom is used in designing a compiler, via some transformations of 
ASTs performed by pattern matching: in fact, this example is the current im- 
plementation of Tom itself (Section l4.2j) . Section [5] presents some related work 
and Section |6] concludes with final remarks and future work. 

2 What Is Tom? 

In this section, we outline the main characteristics of Tom and we illustrate its 
usage on a very simple example specifying a well-known algebraic data type, 
namely the Naturals. 

Tom does not really define a new language: it is rather a language extension 
which adds new matching primitives to an existing imperative language. From 
an implementation point of view, it is a compiler which accepts different native 
languages: C, Java, and Eiffel. The compilation process consists of translating 
new matching constructs into the underlying native language. Since the native 
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language and the target language are identical and only introduced constructs 
are expanded, the presented tool can also be seen as a kind of preprocessor. On 
the other hand, the support of multiple target languages, and the fact that the 
input program has to be completely parsed before the transformation process 
can begin, make us consider Tom as a compiler. 

For expository reasons, we assume that Tom only adds one new construct: 
yoHiatch. This construct is similar to the match primitive found in ML and related 
languages: given a term (called subject) and a list of pairs: pattern-action, the 
match primitive selects a pattern that matches the subject and performs the 
associated action. This construct may thus be seen as an extension of the classical 
switch/ case construct. The main difference is that the discrimination occurs 
on a term and not on atomic values like characters or integers: the patterns are 
used to discriminate and retrieve information from an algebraic data structure. 

To give a better understanding of Tom’s features, let us consider a simple 
symbolic computation (addition) defined on Peano integers represented by zero 
and sueeessor. When using Java as the native language, the sum of two integers 
can be described in the following way: 

Term plus (Term tl, Term t2) { 

7omatch(Nat tl, Nat t2) { 
x,zero -> { return x; } 
x,suc(y) -> { return sue (plus (x,y) ) ; } 

} 

} 

This example should be read as follows: given two terms t\ and ^2 (that represent 
Peano integers), the evaluation of plus returns the sum of t\ and ^2- This is 
implemented by pattern matching: t\ is matched by x, t2 is possibly matched 
by the two patterns zero and suc{y). When zero matches ^2, the result of the 
addition is x (with x = ti, instantiated by matching). When suc{y) matches 
^2, this means that t2 is rooted by a sue symbol: the subterm y is added to x 
and the successor of this number is returned. The definition of plus is given in 
a functional programming style, but the plus function can be used in Java to 
perform computations. This first example illustrates how the /oinatch construct 
can be used in conjunction with the considered native language. 

In order to understand the choices we have made when designing TOM, it is 
important to consider TOM as a restrieted compiler: it is not necessary to parse 
the native language in detail in order to be able to replace the /oinatch constructs 
by a sequence of native language instructions (Java in this example). This could 
be considered as a kind of island parsing, where only the Tom constructs are 
parsed in detail. The first phase of the transformation process consists of reading 
the program: during this phase, the text is read and TOM constructs are rec- 
ognized, whereas remaining parts are considered as target language constructs. 
The output of this first phase is a tree which contains two kinds of nodes: target 
language nodes and Tom eonstruet nodes. When applied to the previous exam- 
ple, we get the following program with a unique TOM node, represented by a 
box as follows: 
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Term plus(Term tl, Term t2) { 

{ return x; } 

{ return suc(plus(x,y)); } 

} 

Tom never uses any semantic information of the target language nodes during 
the compilation process, it does not inspect nor modify the source language 
part. It only replaces the TOM constructs by instructions of the native language. 
In particular, the previous yoinatch construct will be replaced by two nested 
if-then-else constructs. 

At this point, it is interesting to note that plus is a function which takes two 
Term data structures as arguments, whereas the matching construct is defined on 
the algebraic data type Nat. This remark introduces the second generic aspect of 
Tom: the matching can be performed on any data structure. For this purpose, the 
user has to define its term representation and the mapping between the concrete 
representation and the algebraic data type used in matching constructs. 

To make our example complete, we have to define the term representa- 
tion Term and the algebraic data type which defines the sort Nat and three 
operators: {zero :i-^ Nat, sue : Nat Nat, plus : Nat x Nat Nat} 

For simplicity, we consider in this example that the ATerm library jO] is 
used for term representation. This library is a concrete implementation of the 
Annotated Terms data type (ATerms). In particular it defines an interface to 
create and manipulate term data structures. Furthermore, it offers the possibility 
to represent function symbols, to get the arity of such a symbol, to get the 
root symbol of a term, to get a given subterm, ete. The main characteristic of 
this library is to provide a garbage collector and to ensure maximal sharing of 
terms. Using this library, it becomes easy to give a concrete implementation of 
function symbols zero and sue (the second argument of makeAFun defines the 
arity of the operator): AFun f_zero = makeAFun ( "zero” , 0) and AFun f_suc = 
makeAFun C sue ", 1) • The representation of the constant zero, for example, is 
given by makeAppl (f _zero) . Similarly, given a Peano integer t, its successor can 
be built by makeAppKf _suc,t) . So far we have shown how to represent data 
using the ATerm library, and how defining matching with Tom, but, we have 
yet to reveal how these two notions are related. Given a Peano integer t of sort 
Term, we have to define how to get its root symbol (using get AFun for example) 
and how to know if this symbol corresponds to the algebraic function symbol 
sue, intuitively getAFun(t) . isEqual (f _suc) . 

This mapping from the algebraic data type to the concrete implementation 
is done via the introduction of new primitives, 7oOp and “/otypeterm, which are 
described in the next section. 

3 The Tom Language: Main Constructs 

In the previous section we introduced the match construct of TOM via an ex- 
ample. In this section, we give an in-depth presentation of Tom by explaining 



%niatch(Nat tl, Nat t2) { 
x,zero -> 
x,suc(y) -> 

} 
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all existing constructs and their behavior. As mentioned previously, Tom in- 
troduces a new construct (7oniatch) which can be used by the programmer to 
decompose by pattern matching a tree-like data structure (an ATerm for exam- 
ple). Tom also introduces a second family of constructs which is used to define 
the mapping between the algebraic abstract data type and the concrete imple- 
mentation. We distinguish two main constructs: yotypeterm and yop are used 
to define respectively algebraic sorts and many-sorted signature of the algebraic 
constructors. 



3.1 Sort Definition 



In Tom, terms, variables, and patterns are many-sorted. Their algebraic sorts 
have to be introduced by the yotypeterm primitive. In addition to this primitive, 
the mapping from algebraic sorts to concrete sorts (the target language type, 
such as Term) has to be defined. Several sub- functions are used for this purpose. 
To support the intuition, let us consider again the Naturals example where the 
Nat algebraic sort is implemented by ATerms. One possible mapping is the 
following: 



yotypeterm Nat { 

implement { Term } 
get_fun_sym(t) { 

cmp_fun_sym(sl , s2) { 
get_subterm(t ,n) { 

equals (tl ,t2) { 

} 



t . getAFunO 
si . isEqual(s2) 
t . get Argument (n) 
tl . isEqual (t2) 



} 

} 

} 

} 



— The implement construct describes how the algebraic type is implemented. 

The target language part (written between braces: and ^}^) is never 

parsed, it is only used by the compiler to declare some functions and vari- 
ables. This is analogous to the treatment of semantic actions in Yacc. 
Since in this example we focus our attention on the ATerm library, we im- 
plement the algebraic data type using the “implement { Term }” construct. 
But, if we suppose that another data structure is used, “struct myTerm*” 
for example, the “implement { struct myTerm* }” construct should be 
used to define the mapping. 

— get_fun_sym(t) denotes a function (parameterized by a term variable t) that 
should return the root symbol of the term referenced by t. 

As in the C preprocessor, the body of this definition is not parsed, but the 
formal parameter (t) can be used in the body (t .getAfunO in our example). 

— cmp_fun_sym(sl , s 2 ) denotes a predicate (parameterized by two symbol vari- 
ables s\ and 52). This predicate should return true if the symbols s\ and 
52 are “equal”. The true value should correspond to the built-in true value 
of the considered target language, (true in Java, and something different 
from 0 in C for example). 

— get_subterm(t ,n) denotes a function (parameterized by a term variable t 
and an integer n). This function should return the n-th subterm of t. This 
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function is only called with a value of n between 0 and the arity of the root 
symbol of t minus 1. 

— equals (tl ,t2) denotes a predicate (parameterized by two term variables 
ti and ^ 2 ). Similarly to cmp_fun_sym(sl , s2) , this predicate should return 
true if terms t\ and ^2 are “equal”. This last optional definition is only used 
to compile non-linear patterns. It is not required when the specification does 
not contain such patterns. 

When using the ATerm library, it is defined by “{ tl . isEqual (t2) }” 

To clarify the presentation we only used the ATerms data structure, but it 
should be noticed that any other data structure could be used as well. TOM is a 
multi target language compiler that supports C, Java, and Eiffel. 



3.2 Constructor Definition 

In Tom, the definition of a new operator is done via the 7oOp construct. The many- 
sorted signature of the operator is given in a prefix notation. Let us consider the 
sue : Nat 1 -^ Nat operator for instance, its definition is: “7oOp Nat sue (Nat)”. 
We stress once again that because Tom has no knowledge, the user has to 
describe how to represent the newly introduced operator: 

— fsym defines the concrete representation of the constructor. The expression 
that parameterizes fsym should correspond to the expression returned by the 
function get_fun_sym applied to a term rooted by the considered symbol. 

— makeCtl, . . . ,tn) denotes a function parameterized by n variables, where n 
is the arity of the considered symbol. This function should define how a term 
rooted by the considered symbol can be built. The definition of this function 
is optional since it is not used by Tom during the matching phase. However, 
when defined, it can be used by the programmer to simplify the construction 
of terms (see Section m. 

In our setting the definition of sue and zero can be done as follows: 



7oOp Nat zero { 




7oOp Nat sue (Nat) { 


fsym { f_zero } 




fsym { f_suc } 


make { makeAppl (f _zero) } 




make(t) { makeAppl (f_suc ,t) } 


} 




} 



When all needed operators are defined, it becomes possible to use them to de- 
fine terms and patterns. Terms are written using standard prefix notation. By 
convention, all identifiers not defined as constants are seen as variables. Thus, 
the pattern suc(y) corresponds to the term sue{y) where ^ is a variable, and 
the pattern sue (zero) corresponds to the Peano integer one. 



3.3 The yomatch Construct 

The “/omatch construct is parameterized by a subject (a list of terms) on which 
the discrimination should be performed, and a body. As for the switch/ case 



68 P.-E. Moreau, C. Ringeissen, and M. Vittek 

construct of C and Java, the body is a list of pairs: pattern- action. The pattern is 
a list of terms (with free variables) which are matched against a list of terms that 
compose the subject. When the pattern matches the subject, the free variables 
are instantiated and the corresponding action is executed. Note that this is a 
hybrid language construct, mixing two formalisms: the patterns are written in 
a pure algebraic specification style using constructors and variables, whereas 
the action parts are directly written in the native language, using the variables 
introduced by the patterns. Since Tom has no knowledge of what is done inside 
an action, the action part should be written in such way that the function has 
the desired behavior. In our Peano example, the sue (plus (x,y) ) expression 
corresponds to a recursive call of the plus function while the sue function is 
supposed to build a successor. Note that this part has nothing to do with Tom: 
it only depends on the considered target language. The semantic of the yoinatch 
construct is as follows: 

Matching: given a subject, the execution control is transferred to the first 
pattern that matches the subject. If no such pattern exists, the evaluation 
of the yoHiatch construct is finished. 

Selected pattern: given a pattern which matches the subject, the associated 
action is executed, using the free variables instantiated during the matching 
phase. If the execution control is transferred outside the yoinatch construct 
(by a goto, break, or return statement for example), the matching process 
ends. Otherwise, the execution control is transferred to the next pattern- 
action whose pattern matches the subject. 

End: when no more pattern matches the subject, the yoinatch construct ends, 
and the execution control is transferred to the next target language instruc- 
tion. 



3.4 Making Terms 

In addition to sort definition, construction definition and matching constructs, 
Tom provides a mechanism to easily build ground terms over the defined signa- 
ture. This mechanism, called back-quote (and written ‘^’), can be used in any 
target language block as a kind of escape mechanism. The syntax is simple: the 
back- quote is followed by a well-formed term written in prefix notation. The last 
closing parenthesis denotes the end of the escape mechanism. 

Considering the previously defined addition function on Peano integers, the 
right-hand side could have been written ^ sue (plus (x, y) ) and the construc- 
tion of the sue node would have been done by Tom, using the make attribute 
introduced in Section E21 



3.5 Equational Matching 

An important feature of Tom is to support equational matching. In particular, 
list matching, also known as associative matching with neutral element. 
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Since a list can be efficiently and naturally represented by a (head, tail) tuple, 
Tom provides an extra construct for defining associative data structures: the 
7otypelist primitive. When defining such a structure, three extra access func- 
tions have to be introduced: getJiead, get _t ail and is .empty: 

— get Tie add) denotes a function parameterized by a list variable I that should 
return the first element of the list 1. When using the TermList data type, 
the definition is “get _he add) { l.getHeadO 

— get.taild) denotes a function parameterized by a list variable I that 
should return the tail of the list 1. Using ATerms, it can be defined by 
“get_tail(l) { l.getTailO 

— i s .empty (1) denotes a predicate parameterized by a list variable 1. This 
predicate should return true if the list I contains no element. One more time, 
the mapping to ATerms is obvious: “is_empty(l) { l.isEmtpyO 

Similarly to the 7oOp construct, TOM provides the 7oOplist construct to define 
list operators. When using such a construct, the user has to specify how a list 
can be built. This is done via the two following functions: 

— make.emptyO should return an empty list. This object corresponds to the 
neutral element of the considered data structure. 

— make.insert (e,l) should return a new list I' where the element e is in- 
serted at the head of the list I (i.e. expressions equals (get_head(l d ,e) 
and equals (get .tail (Id ,1) should be true). 

One characteristic of list-matching is the possibility to return several matches. 
Thus, the semantic of the 7omatch construct has to be extended as follows: 

Selected pattern: given a pattern which matches the subject, for each com- 
puted match, the list of free variables is instantiated and the action part 
is executed. If the execution control is transferred outside the 7omatch con- 
struct the matching process ends. Otherwise, another match is computed. 
When no more match is available, the execution control is transferred to the 
next pattern- action whose pattern matches the subject. 

This principle can be used to implement a sorting algorithm using a conditional 
pattern matching definition. In the following, we consider an associative data 
structure List and an associative operator cone: 



7otypeterm List { 








implement { TermList } 






get_fun_sym(t) 


{ f- 


cone 


} 


cmp_fun_sym(tl , t2) 


{ ti 


== t2 


} 


equals (11 , 12) 


{ 11 


== 12 


} 


get_head(l) 


{ 1- 


getFirst () 


} 


get_tail (1) 


{ 1- 


getNext 0 


} 


is empty (1) 

} 


{ 1- 


isEmpty 0 


} 



7oOplist List cone 
f sym 

make_empty 0 
make_insert (e , 1) 



( Term* ) { 
{ f_conc } 
{ makeList 
{ 1. insert 



O 

(e) 



} 



} 

} 
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Considering that two Term elements can be compared by a function 
greaterThan, a sorting algorithm can be implemented as follows: 

public TermList sort (TermList 1) { 

7oinatch(List 1) { 

conc(Xl*,x,X2*,y,X3*) -> { 

if (greaterThan(x,y) ) { return ^ sort (cone (XI*, y, X2*, x, X3*) ) ; } 

} 

_ -> { return 1; } 

} 

} 

In this example, one can remark the use of list variables^ annotated by a 
such a variable should be instantiated by a (possibly empty) list. Given a par- 
tially sorted list, the sort function tries to find two elements x and y such 
that X is greater than y. If two such elements exist, they are swapped and the 
sort function is recursively applied. Otherwise, all other possible matches are 
tried (unsuccessfully). As a consequence, the first pattern-action is not exited 
by a return statement. Thus, as mentioned previously, the execution control is 
transferred to the next pattern-action whose pattern matches the subject (the 
second one in this example), and the sorted list 1 is returned. 



4 Applications 

4.1 Implementing Matching Operations Using Tom 

As an example of using list matching, we consider the problem of retrieving 
information in a queue of messages containing two fields: destination and data. 
In our example, we define a function which looks for a particular kind of message: 
a message addressed to b and whose data has a given subject. To illustrate the 
flexibility of Tom, we no longer use the ATerm library, and all data structures 
are internally defined. Thus we use the language C, and we respectively consider 
term and list data structures to represent messages and queues. 



struct term { int symbol; 

int arity; 

struct term ♦♦subterm; 

}; 

%typeterm Term { 
implement { struct term^ } 
get_fun_sym(t) { t->symbol } 

cmp_fun_sym(tl ,t2) { tl == t2 } 

get_subterm(t ,n) { t->subterm [n] } 

} 

%op Term a { fsym { A } } 

%op Term b { fsym { B } } 

%op Term subject (Term) 

{ fsym { SUBJECT } } 

%op Term msg (Term, Term) 

{ fsym { MSG } } 



struct list { struct term ♦head; 

struct list ♦tail; }; 

%typelist List { 
implement { struct list^ } 
get_fun_sym(t) { CONC } 

cmp_fun_sym(tl ,t2) { tl == t2 } 

equals (11, 12) { list_equal (11 , 12) } 
get_head(l) { l->head } 

get_tail(l) { l->tail } 

is_empty(l) { (1 == NULL) } 

} 

%oplist List conc( Term^ ) { 
fsym { CONC } 

make_empty() { NULL } 

make_insert (e,l) { cons(e,l) } 

} 



In the following function, we use a list-matching pattern to search for a particular 
message in a given queue: 
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struct list *read_msg_for_b (struct list *queue , struct term *search_data) { 
%match(List queue) { 

conc(Xl*, msg(b, subject (x)) ,X2*) -> { 
if (term_equal(x,search_data)) { 
print_term("read_msg: ",x); 
return ‘ cone (XI* , X2*) ; 

} 

} 

_->{/* msg not found */ return queue; } 

} 

} 



In this function, when a message addressed to b is found but does not cor- 
respond to search_data, another match is computed (all possible instances of 
XI, X and X2 are tried). If no match satisfies this condition, the default case is 
executed. 



4.2 Implementing Compilers and Transformation Tools 

The presented language extension has an implementation: jton0. One charac- 
teristic of this implementation is that it is written in Tom itself (Java+TOM to 
be more precise). 

Compiling a program consists in transforming this program (written in some 
source language) into another equivalent program written in some target lan- 
guage. This transformation can be seen as a textual or syntactic transformation, 
but in general, this transformation should be done at a more abstract level to 
ensure the equivalence of the two programs. A good and well-known approach 
consists in performing the transformation of the AST that represents the pro- 
gram. 

Representing an AST can be done in a “traditional way” by defining a data 
structure or a class (in an object oriented framework) for each kind of node. 
Another interesting approach consists in representing this tree by a term. Such 
an approach has several advantages. First, it is a universal representation for 
every manipulated information. Second, compared to a collection of spreaded 
objects in memory, a term can be more easily printed and exchanged with other 
tools at any stage of the transformation. Last, all the information is always 
available in the term itself. 

Thus, given a program, its compilation can be seen as the transformation of 
a term (the AST of the source language program) into another term (the AST 
of the target language program). Transformation rules are usually expressed by 
pattern matching, which is exactly what Tom is suited for. 

The implementation of the Tom compiler is an application of this principle: 
it is composed of several phases that respectively transform a term into another 
one. The general layout of the compiler is shown in Figure [TJ 

As illustrated, four main compilation phases can be distinguished. Each phase 
corresponds to an abstract syntax whose signature is defined in Tom, using the 
signature definition formalism presented in Sections 13.11 and 13.21 

^ available at http://elan.loria.fr/tom 
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Target Language + TOM 



Target Language 



Parsing 



(^^typed AST 



Generation 



Type-checking Compilation 

Fig. 1. General layout of the Tom compiler 



Parsing. The Tom parser reads a program enriched by Tom constructs and gen- 
erates an Abstract Syntax Tree. As mentioned previously, the source language is 
a superset of the target language, and we have the following particular equation: 
source language = target language + TOM constructs. 

In order to be as general as possible, the current Tom parser is only slightly 
dependent on the supported target languages. In particular, it does not include 
a full native language parser: it should only be able to parse comments, strings, 
and should recognize the beginning and the end of a block (‘{’ and in C or 
Java). Using this knowledge, the parser can detect and parse all Tom constructs. 
The resulting AST is a list of nodes of two kinds: (1) A Target Language Node 
is a string that contains a piece of code written in the target language. This 
node does not contain any TOM construct; (2) A TOM Construet Node is an 
AST that represents a Tom construct. The role of the Tom compiler consists 
in replacing all Tom Construet Nodes by new Target Language Nodes, without 
modifying, and even parsing, the remaining Target Language Nodes. When con- 
sidering the Naturals example, after parsing, the pattern suc{y) is represented 
by the following AST: 

Term (Appl (Name ("sue") , [Appl (Name ("y") ,[])])) 

Informally, this means that an operator called sue is applied to a list of subterms. 
This list is made of a unique term corresponding to the application of y to the 
empty list. At this current stage, it is not yet possible to know whether y is a 
variable or a constant. We can also remark that there is no type information. 

Type-eheeking. For sake of simplicity, no type information is needed when writ- 
ing a matching construct. In particular, Tom variables do not need to be de- 
clared, and the definition of the signature can appear anywhere. Consequently, 
any constant not declared in the signature naturally becomes a variable. Un- 
fortunately, it makes the compilation process harder. During this phase, the 
Tom type-checker determines the type of each Tom construct and modifies the 
AST accordingly. The output formalism of this phase is a typed Tom AST as 
exemplified below: 

Term (Appl (Symbol (Name ("sue") , 

TypesToType( [Type (TomType ("Nat") ,TLType("Term") )] , 

Type (TomType( "Nat") ,TLType ("Term") ) ) , TLCode("f_suc") ) , 
[Variable (Name ( "y " ) , Type (TomType ( "Nat " ) , TLType ( "Term" )))])) 
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We can notice that the AST syntax (Term, Appl, Name, etc.) has been extended by 
several new constructors, such as Symbol, TypesToType, Variable, etc. A term 
corresponds now to the application of a symbol (and no longer a name) to a list 
of subterms. A symbol is defined by its name, its profile and its implementation 
in the target language (f_suc in this example). We can also notice that the 
profile contains two kinds of type information: the algebraic specification type 
(Nat) and the implementation of this type in the target language (Term). 

Compilation. This phase is the kernel of the Tom compiler: it replaces all 
Tom constructs by a sequence of imperative programming language instruc- 
tions. To remain independent of the target language, this transformation is 
performed at the abstract level: instead of generating concrete target lan- 
guage instructions, the compilation phase generates abstract instructions such 
as DeclareVariable, AssignVariable, If ThenElse. The output formalism also 
contains some abstract instructions to access the term data structure, such as 
GetFunctionSymbol, GetSubterm, etc. After compiling the previous term, we 
get the following AST (for a better readability, some parts have been removed 
and replaced by 

CompiledMatchC [ 

Declaration(Variable(Position( [matchl , 1] ) , 

Type (TomType ( "Nat " ) , TLType ( "Term" ) ) ) ) , 
Assign(Variable(Position( [matchl ,!])»•••), 

Variable (Name ( " t2" ) , Type (TomType ( "Nat " ) , TLType ( "Term" ) ) ) ) 

If ThenElse ( 

EqualFunct ionSymbol (Variable (Posit ion ( [matchl, 1] ) ,•••), 

Appl (Symbol (Name ("sue") ,...))), 

Assign(Variable (Position( [matchl ,1,1]),...), 

GetSubterm(Variable(Position( [matchl , 1] ),...), 0) ) 

Action( [TL( "return sue (plus (x,y) ) ;")]), 

// else part 

) ... 

) 



The main advantage of this approach is that the algorithm for compiling 
pattern matching does not depend on neither the target language nor the term 
data structure. During this phase, a match construct is analyzed, and depending 
on its structure, abstract instructions are generated (a Declaration and an 
Assignment when a variable is encountered for example, or a If ThenElse when 
a constructor is found for example). 

TermList genTermMatchAuto (Term term, TermList path, TermList actionList){ 
7omatch(Term term) { 

Variable (_, termType) -> { 

assign = ^Assign(term, Variable (Position(path) , termType) ) ; 
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action = ^ Action(actionList) ; 
return assign . append (action) ; 

} 

UnamedVariable (termType) -> { 
action = ^ Action(actionList) ; 
return action; 

} 

AppKSymboK . . . ) jtermArgs) -> { 

// generate Declarations, Assignments and an IfThenElse 
failureList = makeListO; 

succesList = declList . append(assignList . append(automataList) ) ; 

cond = ^EqualFunctionSymbol (sub jectVariableAST, term) ; 

return result . append( ^ IfThenElse (cond, succesList , failureList) ) ; 

} 

} 

} 

Generation. This phase corresponds to the back-end generator: it produces a 
program written in the target language. The mapping between the abstract im- 
perative language and the concrete target language is implemented by pattern 
matching. To each abstract instruction corresponds a pattern, and an associated 
action which generates the correct sequence of target language instructions. In 
Java and C for example, the pattern-action associated to the IfThenElse ab- 
stract instruction is: 

IfThenElse (cond, succes, failure) -> { 

prettyPrint ("if (" + generate (cond) + ") {"); 

generate (succes) ; 

prettyPrint ("} else {"); 

generate (failure) ; 

prettyPrint ("}") ; 

} 

Due to lack of space, we cannot give much more detail about the compilation of 
Tom. But, our experience clearly shows that the main interests of Tom can be 
characterized by the expressiveness and the efficiency introduced by the powerful 
matching constructs. In practice, the use of pattern matching and list-matching 
helps the programmer to clearly express the algorithms and, as illustrated in the 
following table, it reduces the size of the programs by a factor 2 or 3 in average. 
We presents statistics for three typical Tom applications, corresponding to the 
three main components of the system: the type-checker, the compiler, and the 
generator. For each component, we report the self-compilation time in the last 
column (measured on a Pentium III, 1200 MHz). The first two columns give some 
size information. For instance, the type-checker consists of 555 lines including 
40 pattern matching constructs. After being compiled, the generated Java code 
consists of 1484 lines. As illustrated by the compilation speed, the efficiency of 
the generated code is sufficient in practice for this kind of application. 
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Specification 


Tom 

(patterns/lines) 


Generated Java code 
(lines) 


Tom to Java 
compilation time (s) 


Tom checker 


40/555 


1484 


0.331 


Tom compiler 


81/1490 


2833 


0.600 


Tom generator 


87/1124 


3804 


0.812 



5 Related Work 

Several systems have been developed in order to integrate pattern matching and 
transformation facilities into imperative languages. For instance, R++ [T] and 
App [7| are preprocessors for C++: the first one adds production rule constructs 
to C++, whereas the second one extends C++ with a match construct. Prop j4] 
is a multi-paradigm extension of C++, including pattern matching constructs. 
Pizza [8] is a Java extension that supports parametric polymorphism, first-class 
functions, class cases and pattern matching. Finally, JaCo [13| is an extensible 
Java compiler written in an extension of itself: Java + extensible algebraic types. 

All these approaches propose some very powerful constructs, but from our 
point of view, they are too powerful and less generic than Tom. In spirit. Prop, 
Pizza and JaCo are very close to Tom: they add pattern matching facilities to 
a classical imperative language, but the method of achieving this is completely 
different. Indeed, Prop, Pizza and JaCo are more intrusive than Tom: they really 
extend C++ and Java with several new pattern matching constructions. On the 
one hand, the integration is better and more transparent. But on the other hand, 
the term data structure cannot be user-defined: the pattern matching process can 
only act on internal data structures. This may be a drawback when one wants 
to extend an existing project, since it is hard to convince a user to program in 
a declarative way if the first thing to do is to translate the existing main data 
structures. 

6 Conclusion and Further Work 

In this paper we have presented a non-intrusive tool for extending existing pro- 
gramming languages with pattern matching. In our opinion, Tom is a key com- 
ponent for the implementation of rule-based language compilers, as well as for the 
design of program transformation tools, provided that programs are represented 
by terms (using for instance ATerms or XML representations). In this context, 
a prototype of ELAN compiler using Tom as back-end has already been success- 
fully implemented for a subset of the language, and the Asf+Sdf ground and 
the ELAN groujl^ are currently designing a common extensible compiler based 
on Tom. 

For the sake of expressiveness, it is important to continue the integration 
of equational matching into TOM. For now, we have successfully considered the 

^ http://www.cwi.nl/projects/MetaEnv 
^ http://elan.loria.fr 



76 



P.-E. Moreau, C. Ringeissen, and M. Vittek 



case of list-matching, which was already supported by AsF-hSDF. In the future, 
we still have to go beyond this hrst case-study by considering other more com- 
plicated and useful equational theories like Associativity- Commutativity and its 
extensions. 
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Abstract. We present a translation from the call- by- value A-calculus 
to monadic normal forms that includes short-cut boolean evaluation. 
The translation is higher-order, operates in one pass, duplicates no code, 
generates no chains of thunks, and is properly tail recursive. It makes a 
crucial use of symbolic computation at translation time. 



1 Introduction 

Program transformation and code generators offer typical situations where sym- 
bolic computation makes it possible to merge several passes into one. The CPS 
transformation is a canonical example: it transforms a term in direct style into 
one in continuation-passing style (CPS) [39il43] . It appears in several Scheme 
compilers, including the first one |^|33]|42, where it is used in two passes: 
one for the transformation proper and one for the simplifications entailed by 
the transformation (the so-called “administrative redexes”). One-pass versions 
have been developed that perform administrative reductions at transformation 
time [am. They form one of the first, if not the first, instances of higher-order 
and natively executable two- level specifications. 

The notion of binding times was discovered early by Jones and Muchnick m 
in the context of programming languages. Later it proved instrumental for partial 
evaluation [2HI, for program analysis [37], and for code generation [30] • It was 
then soon noticed that two- level specifications (i.e., ‘staged ’ m, or ‘binding- 
time separated ’ ESI, or again ‘binding-time analyzed’ [2S] specifications) were 
directly expressible in languages such as Lisp and Scheme that offer quasiquote 
and unquote — a metalinguistic capability that has since been rediscovered in 
‘C PE], cast in a typed setting in MetaML ES], and connected both to modal 
logic [18] and to temporal logic m- In Lisp, quasiquote and unquote are used 
chiefly to write macros [5], an early example of symbolic computation during 
code generation [32]. In partial evaluation [10|[26|, two- level specifications are 
called ‘generating extensions’. Nesting quasiquote and unquote yields macros 
that generate macros and multi-level generating extensions. 
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The goal of this article is to present a one-pass transformer into monadic 
normal forms (23l|36] that performs short-cut boolean evaluation, duplicates no 
code, generates no chains of thunks, and is properly tail recursive. We consider 
the following source language: 

3 e ::= £ \ x \ Xx.e | ee | if 6 then e else e 
3 b ::= e \ b Ab \ bV b \ \ if 6 then 6 else 6 

We translate programs in this source language into programs in the following 
target language: 

Ani 3 c ::= return | 

\etx = vv\n c \ vv \ 
if then c else c | 
let X = A().c in c I xQ 
A^i 3 V ::= £ \ x \ Xx.c 

The source language is that of the call- by- value A-calculus with literals, con- 
ditional expressions, and computational effects. The target language is that of 
monadic normal forms (sometimes called A-normal forms [ 211 ), with a syntac- 
tic separation between computations (c, the serious expressions) and values {v, 
the trivial expressions), as traditional since Reynolds and Moggi [36l|4T]. The 
return production is the unit and the first let production is the bind of monadic 
style |47]. Computations are carried out by applications, which can either be 
named with a let expression or occur in tail position. Conditional expressions 
exclusively occur in tail position. The last two productions specify the declaration 
and activation of thunks, which are used to ensure that no code is duplicated. 
For example, a source term such as 

Xx.go {ho (if {gi {hi x)) V x then g 2 (^2 x) else x)) 

is translated into the following target term (automatically pretty printed in 
Standard ML for clarity), in one pass. 

return (fn x => let val kO = fn wl => let val w2 = hO wl 

in gO w2 
end 

val t5 = fn 0 => let val w3 = h2 x 
val w4 = g2 w3 
in kO w4 
end 

val w6 = hi X 
val w7 = gl w6 

in if w7 

then t5 () 
else if X 

then t5 () 
else kO x 



end) 
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In this target term, the source context {ho [•]) is translated into the function 
kO, where the outside call occurs tail recursively. Because of the disjunction in 
the test, a thunk t5 is created for the then branch. In this thunk, the outside 
call occurs tail recursively. The composition of gi and hi is sequentialized and 
its result is tested. If it holds true, t5 is activated; otherwise, the second half 
of the disjunction is tested. If it holds true, t5 is activated (the code for t5 is 
shared). Otherwise, the value of x is passed to the (sequentialized) composition 
of go and ho. Free variables (i.e., go, ho, gi, hi, g2, and / 12 ) have been translated 
to themselves (i.e., gO, hO, gl, hi, g2, and h2, respectively). 

Monadic normal forms offer the main advantages of CPS (i.e., all intermediate 
results are named and their computation is sequentialized) 1^ and they have been 
used in compilers for functional languages |7]|6l [^[^l^[Mll^|46] . Therefore, 
a one-pass transformation into monadic normal form with short-cut boolean 
evaluation could well be of practical use (i.e., outside academia). 

The rest of this article is organized as follows. We present a standard, two- 
pass translation from the source language to the target language (Section 0, 
and then its one-pass counterpart (Section [^. We then illustrate it (Section [4]), 
assess it (Section O, and then review related work and conclude (Section [Q. 



2 A Standard, Two-Pass Translation 

The first part of the translation is simple enough: it is the standard encoding of 
the call-by-value A-calculus into the computational metalanguage, straightfor- 
wardly extended to handle conditional expressions. 

Syli] = return^ 

Evlxj = return x 
f^|Ax.e] = return Ax. 5^ |e] 

Ev{eo ei] = let rco = Evleoj in \etwi = 5^|ei] \nwoWi 
£y |if b then ei else eo] = if By |6] then £y |ei] else £y |eo] 

By\e\ = S^le} 

By\bi A 62] = if 1^1] then S^;|62] e\se false 
By\bi V 62] = if then true else S^|62] 

Bv [“'^l = if By |6] then false else true 
By |if 62 then bi else boj = if By I62] then By |6i] else By |6o] 

The second pass of the translation consists in performing monadic simplifi- 
cations m and in unnesting conditional expressions until the simplified term 
belongs to 



^ The jury is still out about the other advantages of CPS [40) . 
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3 A One-Pass Translation 

In this section, we build on the full one-pass transformation into monadic normal 
form for the call- by- value A-calculus: 



£\l\ — return ^ 

£\x\ — return x 
f |Ax.e] = return Ax.g|e] 

f |eo ei] = fc[eo] \vo£c\ei\ Xvi.vq^vi 



Sclij hi = 1^ ^ i 

Eclxj hi = hi ^ X 

fc|Ax.e] hi = hi Ax.f |e] 

^i] hi = fcl^o] A'L’o-^cIei] A'L’i.l^re = 'L’o M in ^ ^ 



The function £ is applied to subterms occurring in tail position, and the function 
£c to the other subterms; it is indexed with a functional accumulator This 
transformation is higher-order (witness the type of £c) and it is also two level: 
the underlined terms are hygienic syntax constructors and the overlined terms 
are reduced at transformation time (@ denotes infix application). We show in 
appendix how to program it in ML. This transformation is similar to a higher- 
order one-pass CPS transformation, which can be transformationally derived 
from a two-pass specification pT] . 

The question now is to generalize this one-pass transformation to the full 
and from Section [TJ Our insight is to index the translation of each boolean 
expression with the translation of the corresponding consequent and alternative. 
Each of them can be the name of a thunk, which we can use non-linearly, or a 
thunk, which we should only use linearly since we want to avoid code duplication. 
Enumerating, we define four translation functions for boolean expressions: 

Bcc A^ ^ {I ^ A^i) X (1 ^ A^i) A^i 

Byy : A^ A]^i X A^i A^i 

Bey : A^ ^ {1 ^ A^i) X ^ A^i 
Bye : A^ A^i X (1 ^ A^i) A^i 

The problem then reduces to following the structure of the boolean expressions 
and introducing residual let expressions to name computations if their result 
needs to be used more than once. 

^ We refrain from referring to as a continuation since it is not applied tail recursively. 
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For example, let us consider Scc[^i A 62] (/^i,/^o)5 i*e., the translation of a 
conjunction in the presence of two thunks ni and Kq. The activation of tzi and 
hq will yield the translation of the consequent and of the alternative of this 
conjunction. Naively, we could want to define the translation as follows: 

Bcclbll (A().Scc[&2l {ki,Kq),Kq) 

Doing so, however, would duplicate /^o, i*e., the translation of the alternative 
of the conjunction. Therefore we name its result with a let. The rest of the 
translation follows the same spirit. 
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As for the connection between translating a boolean expression and translat- 
ing an expression, we make it using a functional accumulator that will generate 
a conditional expression when it is applied. 




Finally we connect translating an expression and translating a boolean ex- 
pression as follows. 



f[if 6then ei elseeo] = Bcdbj (A().f [ei], A().f |eo]) 

£c\}^ b then ei else eoj k = \_^k = Xw.k @ w 

inSccl&l (A().^„|eil A:, A().f„|eo] k) 









£y\x\k = k^x 
f^|Ax.e] k = k 

^vl^o ei] k = fcl^o] Ai;o.fc[ei] Xvi.l^w = vq ^vi'\nk 
£y |if b then ei else eo] k = Bcc M (A() .£y {eijk, A() .£y |eo] k) 



In the second equation, a let expression is inserted to name the context (and 
to avoid its duplication). £y is there to avoid generating chains of thunks when 
translating nested conditional expressions. 
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The result can be directly coded in ML (see appendix): the source and target 
languages are implemented as data types and the translation as a function. A 
side benefit of using ML is that its type inferencer acts as a theorem prover to 
tell us that the translation maps terms from the source language into terms in 
the target language (a bit more reasoning, however, is necessary to show that 
the translation generates no chains of thunks). Finally, since the translation is 
specified compositionally, it does operate in one pass. 



4 Two Examples 

4.1 No Chains of Thunks 

The term Xx.g {h (if a then if 62 then bi else bo else x)) is translated into the fol- 
lowing target term in one pass. 

return (fn x => let val kO = fn vl => let val v2 = h vl 

in g v2 
end 

in if a 

then if b2 

then kO bl 
else kO bO 
else kO x 
end) 

Each conditional branch directly calls kO. 



4.2 Short-Cut Boolean Evaluation 

The term Ax. if ai A U2 A as A then x else g {h x) is translated into the following 
target term in one pass. 

return (fn x => let val fl = fn () => let val vO = h x 

in g vO 
end 

in if al 

then if a2 

then if a3 

then if a4 

then return x 
else fl 0 
else fl 0 
else fl 0 
else fl 0 
end) 



All the else branches directly call fl. 
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5 Assessment 

A similar development yields, mutatis mutandis, a CPS transformation that is 
higher-order, operates in one pass, duplicates no code, generates no chain of 
thunks, and is properly tail recursive. 

The author has implemented both transformations in his academic Scheme 
compiler. Their net effect is to fuse two compiler passes into one and to avoid, in 
effect, an entire copy of the source program. In particular, an escape analysis of 
the transformations themselves shows that all of their higher-order functions are 
stack-allocatable [4]. The transformations therefore have a minimal footprint in 
that they only allocate heap space to construct their result, making them well 
suited in a JIT situation. 

6 Related Work, Conclusion, and Future Work 

We have presented a two-level program transformation that encodes call- by- value 
A-terms into monadic normal form and achieves short-cut boolean evaluation. 
The transformation operates in one pass in that it directly constructs the normal 
form without intermediate representations that need further processing. As usual 
with two-level specifications, erasing all over lines and underlines yields something 
meaningful — here an interpreter for the call-by-value A-calculus in the monadic 
metalanguage. 

The program transformation can be easily adapted to other evaluation orders. 
Short-cut evaluation is a standard topic in compiling mElIM!. The author is 
not aware of any treatment of it in one-pass CPS transformations or in one-pass 
transformations into monadic normal form. 

Our use of higher-order functions and of an underlying evaluator to fuse a 
transformation and a form of normalization is strongly reminiscent of the notion 
of normalization by evaluation [HllIHm| EIT | . And indeed the author is convinced 
that the present one-pass transformation could be specified as a formal instance 
of normalization by evaluation — a future work. 

Monadic normal forms and CPS terms are in one-to-one correspondence m, 
and Kelsey and Appel have noticed the correspondence between continuation- 
passing style and static single assignment form (SSA) OEI]. Therefore, the 
one-pass transformation with short-cut boolean evaluation should apply directly 
to the SSA transformation m — another future work. 

Acknowledgments. Thanks are due to Mads Sig Ager, Jacques Carette, 
Samuel Lindley, and the anonymous reviewers for comments. 

A Two-Level Programming in ML 

We briefly outline how to program the one-pass translation of Section El [II 4J . 

First, we assume a type for identifiers as well as a module generating fresh 
identifiers in the target abstract syntax: 
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type ide = string 

signature GENSYM = sig 

val init : unit -> unit 
val new : string -> ide 
end 

Given this type, the source and the target abstract syntax (without condi- 
tional expressions) are defined with two data types: 

structure Source = struct 

datatype e = VAR of ide 

I LAM of ide * e 
I APP of e * e 

end 



structure Target = struct 

datatype e 



= RETURN of t 
I TAIL.APP of t * t 
I LET_APP of ide * (t * t) * e 
= VAR of ide 
I LAM of ide * e 



Given a structure Gensym : GENSYM, the two translation functions £ and £c are 
recursively defined as two ML functions transO and transl. In particular, transl 
is uncurried and higher order. For readability of the output, the main translation 
function trans initializes the generator of fresh identifiers before calling transO: 

(* transO : Source. e -> Target. e *) 

(* transl : Source. e * (Target. t -> Target. e) -> Target. e *) 
fun transO (Source. VAR x) 

= Target .RETURN (Target. VAR x) 

I transO (Source. LAM (x, e)) 

= Target .RETURN (Target. LAM (x, transO e)) 

I transO (Source. APP (eO, el)) 

= transl (eO, 

fn vO => transl (el, 

fn vl => Target . TAIL_APP (vO, vl))) 

and transl (Source. VAR x, k) 

= k (Target. VAR x) 

I transl (Source. LAM (x, e) , k) 

= k (Target. LAM (x, transO e)) 

I transl (Source. APP (eO, el), k) 

= transl (eO, 

fn vO => transl (el, 

fn vl => let val v = Gensym. new "v" 
in Target . LET_APP 
(v, (vO, vl), 
k (Target. VAR v)) 



end) ) 
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(* trans : Source. e -> Target. e *) 
fun trans e 

= (Gensym. init () ; transO e) 
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Abstract. Many important software systems are written in the C pro- 
gramming language. Unfortunately, the C language does not provide 
strong safety guarantees, and many common programming mistakes in- 
troduce type errors that are not caught by the compiler. These errors 
only manifest themselves at run time through unexpected program be- 
havior, and it is often hard to isolate and identify their causes. This paper 
presents the Hobbes run-time type checker for compiled C programs. Our 
tool interprets compiled binaries, tracks type information for all memory 
and register locations, and reports warnings when a variety of type errors 
occur. Because the Hobbes type checker does not rely on source code, it 
is effective in many situations where similar tools are not, such as when 
full source code is not available or when C source is linked with program 
fragments written in assembly or other languages. 

1 Introduction 

Many software systems are written in the C programming language because it is 
expressive and provides precise, low-level control over the machine architecture. 
However, this strength is also a weakness. The expressive power of C is obtained 
through unsafe language features, including pointer arithmetic, explicit memory 
management, unchecked type casts, and so on. These features give the program- 
mer a great deal of control but also make it difficult to ensure software reliability 
and to maintain large programs. 

Given the importance of many systems in this category, it is essential to 
identify defects caused by improper use of unsafe language features. In this 
paper, we present Hobbes, a new run-time analysis tool that identifies a large 
class of errors in compiled C programs. In particular, our tool identifies memory 
access errors and type errors. A memory access error occurs when a program 
accesses an invalid memory location. Two examples of such errors are (1) reading 
from or writing to an unallocated location, and (2) reading from an allocated but 
uninitialized location. A type error occurs when an operation is performed on 
operands whose types are incompatible with the operation. Adding a pointer to 
a real number, calling a function with the wrong number or type of arguments, 
and dereferencing an integer as a pointer are all type errors. 

* This work was performed, in part, while all 3 authors were employed at the Compaq 
Systems Research Center (now part of HP Labs) . 
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To catch errors, our tool maintains a shadow memory containing the allo- 
cation status and type of each location accessible to the target program, which 
it updates and checks as the target is running. Purify demonstrated the ef- 
fectiveness of a shadow memory-based approach for identifying memory access 
errors [3]- Purify modifies the target program to maintain allocation and ini- 
tialization status for each memory location, and it instruments each memory 
operation to check that the status information for the address being accessed is 
in an appropriate state. The Hobbes type checker goes beyond Purify by tracking 
not only memory status information, but also the type stored at each location. 
The type information enables our tool to check the types of the operands for 
each operation performed as the program executes. 

The Hobbes prototype checks for errors in Linux binaries on the Intel x86 
architecture. Hobbes consists of two major components: an instrumentable x86 
interpreter and a run-time type checker. To check a program for type errors, the 
type checker maintains the shadow memory and checks each interpreted instruc- 
tion for errors. Memory access and type errors are reported to the programmer, 
along with the call stack and the relevant data values and types. 

The type checker extracts type information from the symbol tables and debug 
tables embedded in the binary program. It uses this information to determine 
the types of storage locations allocated to global variables, local variables, and 
parameters of functions. When debugging information is incomplete or not avail- 
able, the type checker assumes more conservative types for memory locations. 
Even when given only partial type information for the target program, Hobbes 
can still identify a useful set of errors. 

The Hobbes architecture provides the following benefits: 

1. Hobbes uses only the binary representation of programs and does not rely 
on the source code for the target program or included libraries. 

2. Hobbes is applicable to programs written in a mixture of any languages that 
compile into the standard binary format. 

3. Hobbes does not modify the data representations or layout of the program. 

We are not aware of other tools that provide all three of these benefits. 
Loginov et al. present a system similar to ours that employs source-to-source 
translation to insert code to maintain and check shadow memory [I]. Relying 
on source code translation limits their handling of libraries and mixed-language 
programs, and their tool does not preserve the instruction stream of the original 
program. Several other tools have been proposed to check for memory access 
errors and some type errors by extending the representation of pointers to include 
additional information (see, for example, nan). However, we wished to avoid 
changing the data layout of the program since such changes are not always 
feasible in large systems. 

Our experience indicates that the Hobbes type checker is an effective tool 
for finding type errors in programs. When applied to a set of programs from 
an undergraduate compilers class, it found a number of both memory errors 
and type errors, and it scaled reasonably well when checking larger programs. In 
particular, the false alarm rate was not a significant impediment to using the tool. 
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There is a substantial performance penalty for using the Hobbes type checker 
prototype, but we are confident that improvements we describe will significantly 
improve the performance of the system. 

Section [21 motivates this work by demonstrating how run-time type checking 
can catch a number of common errors in C programs. Sections Eland HI describe 
the general Hobbes architecture and the type checker, respectively. We summa- 
rize our experiments to validate the type checker and measure performance in 
Section in, and Section [6] compares the Hobbes type checker to related work. We 
conclude in Section [7| and outline directions for future work. 

2 Motivating Examples 

In this section, we present some errors that the Hobbes type checker catches, but 
which are not caught by the C compiler’s static type checking or the allocation 
checking performed by tools like Purify. Figure [T] contains programs exhibiting 
these errors. In each case, we outline how the errors are caught. 

In Example 1, the programmer writes a pointer into a union but then reads 
the union value as an integer. On the store to x.p, the type checker sets the 
shadow memory for that location to pointer. The type pointer is inferred 
because a lea (load effective address) instruction is used to compute &i. A mul- 
tiply instruction can not be applied to an operand of type pointer, so when the 
multiply of x.k occurs, the type checker detects the type mismatch and gener- 
ates a warning message. This example is interesting because it shows that useful 
type checking can be done without any help from the compiler or debugging 
information. 

Example 2 shows an array bounds error that is not normally detected by Pu- 
rify or similar systems. The programmer writes to y . a[10] , which is beyond the 
end of the array, but still part of an allocated structure. The assignment over- 
writes the field y . h, which follows the array in memory. If debugging information 
is included in the program binary, the type checker knows that y . h should have 
type int. It reports an error when the program writes a value of type pointer 
instead. If no debugging information is available, the write is permitted, but the 
type checker detects an error when the value of type pointer in y.h is later 
used in a multiplication. 

Example 3 shows a common pitfall in the use of the standard C library sorting 
function, qsortO. The comparison function required by qsortO is called with 
pointers to the elements to be compared, rather than the elements themselves. 
The naive programmer who wrote Example 3 omitted this extra level of indi- 
rection. A cast is almost always required when using qsortO, and the one used 
here, though not unusual, masks the error. Given the debugging information for 
the program, the type checker expects values of type int for each parameter 
of cmpintO. When values of type pointer are passed instead, it generates a 
warning. 
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Example 1: 

union { 
int k; 
int *p ; 

} x; 

void exlO { 
int i, j; 
x.p = &i; 
j = 17 * x.k; 

} 



Example 3: 



Example 2: 

struct { 

int *a[10] ; 
int h; 

> y; 

void ex2() { 
int i, j; 

for (i = 0; i <= 10; i++) 
y.a[i] = &j; 
y.h *= 10; 

} 



int cmpint (int a, int b) { return ((b < a) - (a < b)); } 



void ex3() 
int i ; 

int array [N] ; 



qsort (array, N, sizeof (array [0]), 

(int (*) (const void const void *)) cmpint); 

} 



Fig. 1. C programs with type errors. 



3 The Hobbes System Architecture 

Hobbes consists of two distinct pieces: an x86 interpreter that runs the target 
program and the type eheeker analysis tool — a module that is called by the inter- 
preter when events of interest occur in the target. The operating system kernel, 
in this case Linux, is unmodified. In this section, we describe the interpreter. In 
the next section, we describe the type checker. 

The Hobbes platform is a general framework in which to build analysis tools 
like the Hobbes type checker. The interpreter plays the same role as a binary 
editor like Atom or the instrumentable dynamic compiler that underlies 
Valgrind m- An analysis tool first registers interest in events that may occur 
while the target is running. For example, a tool may indicate that it wants 
notification each time the target accesses memory or executes a specific opcode. 
The interpreter then runs the target, which is unaware that instrumentation is 
taking place, and calls analysis routines provided by the tool when interesting 
events occur. Arguments to the analysis routines convey relevant information 
about the event, indicating any memory addresses, values, and registers involved. 
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Target Name Space 



Target 



typecheck.so 



Id-linux.so 

libc.so 



Interpreter Name Space 



Interpreter 



Id-linux.so 

libc.so 



Fig. 2. The Hobbes system architecture. 



A major goal of Hobbes is to provide a program environment for the target 
that is as close as possible to its normal execution environment. This goal in- 
fluenced our design in two ways. First, we placed the components of Hobbes in 
normally unused parts of the address space to avoid having to relocate the target. 
Second, Hobbes uses two distinct name spaces, as illustrated in Figure The 
target name space contains the target program, the type checker tool (which is a 
shared library) and other libraries required by the target. The interpreter name 
space contains the interpreter and the (potentially different) set of libraries that 
are linked with it. This separation prevents problems arising from name clashes 
or version mismatches between the libraries used in the interpreter and those 
used in the target, as well as potential interference problems caused by the in- 
terpreter and target sharing library data structures. Analysis tools reside in the 
target name space to give them access to the target dynamic loader, which is 
used to resolve target addresses to names. 

The interpreter name space is created by the Linux kernel. To create the 
target name space, the Hobbes interpreter simulates the actions that the kernel 
would have taken to run the target, including running a new copy of the dynamic 
loader (id-linux . so). This second loader loads the target, its shared libraries, 
and the analysis tools. 

The Hobbes interpreter is written entirely in x86 assembly language. The 
main loop in the interpreter fetches instructions from the target instruction 
stream and performs computed jumps into tables whose entries are code frag- 
ments. The code fragment for each instruction: 

1. decodes the operand specifiers and loads the addresses of the operands into 
specific registers, 

2. performs the operation defined by the instruction opcode on those registers, 
and 

3. moves the result, if the instruction has one, to its ultimate destination. 

Except in rare circumstances, the core of the implementation for each opcode is 
performed by the corresponding x86 instruction. This method allows side-effects. 
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such as the setting of condition codes, to be captured faithfully, and it improves 
the chances of correctly emulating the execution of unusual code sequences or 
instructions that behave differently on different x86 implementations. 

The shared libraries for tools register analysis routines with the interpreter 
when the libraries are initialized as part of the loading process. If a tool has 
registered an analysis routine for a particular instruction opcode, a call to the 
analysis routine is inserted into the appropriate table entry between the first 
and second step above. The analysis routines are executed directly, not inter- 
preted. The interpreter does not interpret operating system kernel code. When 
the interpreter encounters a system call, it executes a kernel trap in the normal 
way, after loading the registers with the arguments needed for the call. Some 
calls, notably those dealing with signals, are handled specially so as to maintain 
control of the target program. 

Other analysis frameworks, such as Atom and Valgrind, employ binary code 
modification techniques to avoid the overhead of interpreting machine code. Al- 
though we could have adopted these techniques to obtain better performance, 
we chose to implement the interpreter for several reasons. First, interpretation 
preserves the layout and location of the target code and data segments, which 
reduces the likelihood of introducing unintended errors into the target during 
instrumentation. Also, no publicly available binary editor or dynamic compiler 
existed for x86 Linux when we started (Valgrind had not yet been released), 
and writing the interpreter was the simplest and fastest way to build a working 
prototype. In addition, the Hobbes type checker imposes a large overhead on 
execution beyond the interpreter’s overhead, making the argument for more effi- 
cient instrumentation techniques less compelling. The large overhead is partially 
due to the type checker instrumenting virtually every instruction in order to 
track values as they pass through registers. In contrast. Purify only instruments 
memory accesses. 

4 The Hobbes Type Checker 

During startup, the Hobbes type checker shared library initializes its internal 
data structures and shadow memory and registers analysis routines with the 
interpreter. When an instruction of the target program is interpreted, the type 
checker tests the types of the operands and updates the type information in the 
shadow memory according to the instruction semantics. Any inconsistencies are 
reported to the user. In this section, we describe the shadow memory layout and 
data structures used by the type checker, demonstrate the steps to type check 
instructions and function calls, and describe features that reduce occurrences of 
false alarms. A false alarm occurs when the type checker incorrectly reports that 
a type error has occurred. 

Shadow Memory and Type Representation. The x86 architecture provides 
a 4 GB address space, which the type checker divides into three sections. It 
uses addresses 0x00000000 - OxBfffffff for the target program memory and 
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addresses 0x60000000 - Oxbfffffff for the shadow memory. The Linux kernel 
utilizes parts of the remaining 1 GB, and we do not use it in Hobbes. Each 
byte in the target program is matched by a byte in the shadow memory, which 
encodes its type. To map from a data address to its shadow address, the type 
checker simply adds 0x60000000 to the data address. The interpreter places the 
virtual registers in the target address section so that the interpreter can shadow 
them like all other locations accessible to the target. 

The type checker tracks primitive C types. It currently represents structures, 
unions, and arrays as sequences of these primitive types and does not distinguish 
between different pointer types. Each primitive type is encoded in the shadow 
memory with a bit pattern equal to the size of the type in length. Eor example, 
the four-byte integer encoding covers four bytes of shadow memory. Each byte 
in the shadow memory contains four fields: 

civ t 

The continuation bit c is zero if the byte is the first byte of a type encoding, 
and one otherwise. When c is set, the other seven bits are unused. The initialized 
hit i indicates whether the corresponding data object has been initialized. The 
invariant bit v indicates whether the type of the corresponding data object may 
change during execution (i.e., whether the type encoding may be overwritten 
with a different type encoding). The type checker marks global data and stack 
locations for parameters and local variables as invariant when the debugging 
information supports doing so. The base type t encodes the type of the data. The 
type checker currently supports the primitive types intS (char), uintS (unsigned 
char), unk8 (one byte of unknown type), intl6, uintl6, unkl6, int32, uint32, 
unk32, float, double, and pointer. The unallocated base type indicates one 
byte of unallocated memory, and the code base type indicates one byte of code. 

We present several type encodings to illustrate the layout of these structures: 



uninitialized pointer 



initialized, unsigned 
char 


0 0 0 




pointer 


1 






0 10 uintS 


1 








1 







initialized, invariant 
short 



0 1 1 




intl6 


1 







Currently, there is unused space in the encodings for multi-byte types. However, 
these encodings enable a fast mapping function from each data value to its 
shadow memory and are easy to decode. In addition, we plan to use the remaining 
space to encode aggregate type and pointer information in the future. 



Type Checker Initialization. The type checker performs the following ini- 
tialization steps before the target program begins execution. 

1. The type checker reads all available debugging information in the target 
and shared library object code. This information includes type declarations, 
function prototypes, global and local variable declarations, and the mapping 
from names to addresses. 
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2. For each type, the type checker creates a template block of shadow memory 
that encodes it, as outlined above. Hobbes creates the template block for 
aggregate types by concatenating the template blocks for each element. If 
part of a type is unknown or ambiguous, such as when a union may contain 
two different primitive types, the corresponding part of the template block 
contains the unknown type encoding of the appropriate size. 

3. For each function, the type checker creates two template blocks, one for its 
parameters and one for its local variables. These blocks contain the shadow 
memory encodings for the function’s activation record. Local variable lo- 
cations that may contain different types at different points in the function 
are assigned the unknown type. All encodings in these blocks are invariant, 
except those for unknown types, which are not marked as invariant. 

4. The type checker initializes the shadow memory for global variables with the 
template blocks created in step 2. All global variables are invariant, except 
those which have unknown type. 

5. The type checker registers analysis routines for opcodes and system calls 
with the interpreter. 

The type checker precomputes the type representations and template blocks 
to avoid translating types into their encodings at run time. The types for loca- 
tions originally marked as unknown are refined during execution as the type 
checker observes which operations are performed on the data. As described 
above, unknown types are introduced for locations where values of different 
types stored may be stored. They are also used when the target or libraries con- 
tain incomplete type information, which may occur if they are compiled without 
generating debugging information, if they are linked with hand- written assembly 
code, and so on. 

The type checker also overrides malloc, free, and other memory manage- 
ment routines with versions that update and test the shadow memory in the ob- 
vious ways. When the interpreted program makes a system call, the interpreter 
copies the arguments from the virtual registers into the processor’s registers 
and then performs the standard kernal trap. The typechecker installs built-in 
instrumentation callbacks to check the validity of the argument types prior to 
the system call and to set the type of the return value afterwards. The most 
common 30 system calls have been instrumented to date. 



Instruction Analysis Routines. Each instruction analysis routine type checks 
all occurrences of a specific instruction opcode in the target execution stream. 
The interpreter provides the routines with the locations of the source and desti- 
nation operands. Each routine 

1. checks that the source operands are allocated and initialized, and that the 
destination is allocated; 

2. checks that load and store instructions and indirect addressing modes only 
dereference data of type pointer; 

3. checks the types of the source operands and computes the result type; and 
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4. updates the shadow memory to reflect the destination’s new type, if appli- 
cable. 

We elaborate on the third and fourth steps for a representative instruction. The 
type checker begins step 3 by extracting the type information for the sources 
from the shadow memory. A table is then indexed to determine the result type. 
To illustrate this process, we consider the instruction addl SRC, DST. This 
instruction adds SRC to DST, storing the result in DST. Both operands are four 
byte values, as indicated by the suffix 1 in the instruction name. For example, in 
addl 7oeax, 4(7oebp) , the SRC operand is in register “/oeax and the DST operand 
is located at the address stored in register 7oebp, plus 4. 

The following table computes the result type for the operation, based on the 
types of SRC and DST. For simplicity, we include only a few of the possible 
operand types. 



addl SRC, DST 



SRC 


DST 


intS 


int32 


pointer 


unk32 


intS 


error 


error 


error 


error 


int32 


error 


int32 


pointer 


unk32 


pointer 


error 


pointer 


error 


unk32 


unk32 


error 


int32 


pointer 


unk32 



The type checker generates a warning whenever a table lookup returns error. 
In this example, the operands may be two integers or an integer and a pointer, 
but not two pointers. If DST is unknown, the result stays unknown. If SRC is 
unknown, the result type will be the type of DST. These heuristics for unknown 
types are not sound, but they reduce the false alarm rate when precise type 
information is not available for the operands. To aid in debugging, the type 
checker reports the stack trace and relevant memory and register values’ types 
for each warning. If debugging information is available, the stack trace includes 
the source’s file name and line number. 

We show type compatibility tables for the four byte mov instruction and the 
lea instruction below. The lea sets DST to be the address of the SRC. These 
two instructions are insensitive to the original type of DST. 

movl SRC, DST leal SRC, DST 



SRC 


DST 


intS 

int32 

pointer 

unk32 


pointer 

pointer 

pointer 

pointer 



SRC 


DST 


intS 

int32 

pointer 

unk32 


error 

int32 

pointer 

unk32 



Before returning control to the interpreter, the type checker writes the result 
type into the shadow memory for DST. If the new result type is different from 
the current type of DST and DST is invariant, the checker generates a warning. 
Otherwise, the initialized form of the result type is written into the shadow 
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memory. Since this type may have a different size than what was there previously, 
the type checker assigns an unknown type to any partially overwritten type 
encodings immediately before and after the shadow memory for DST. 

The Function Call Analysis Routine. When the interpreter invokes the 
analysis routine for the call instruction, the type checker first maps the target 
address of the call to the corresponding function and fetches its precomputed 
parameter and local variable template blocks. The type checker then compares 
the types of the arguments on the stack against the types in the parameter 
template block, reporting any mismatches. It also copies the type information 
for the local variable template block into the shadow memory at the appropriate 
offset from the frame pointer for the new function’s activation record. Local 
variables begin as uninitialized. Full function checking can not be done if the 
function has a variable number of arguments or uses a non-standard activation 
record, which may occur when a compiler employs certain optimizations, such 
as tail-call elimination. 

Reducing False Alarms. The initial version of our type checker reported false 
alarms on some common compiler idioms for the x86 architecture. For example, 
the gcc compiler may emit an xor instruction to clear a register containing 
a pointer. The compiler also uses the lea instruction to perform addition in 
certain cases. To avoid generating false alarms in situations like these, we relaxed 
the typing restrictions in the instruction type tables. In the case of xor, which 
originally used the same table as add above, we permitted two pointer operands, 
as long as they are the same storage location. For the lea instruction, we deviated 
from the table presented above by setting the result type to int32 if the result 
value is between negative one million and one million. Numbers in this range are 
much more likely to be integers than addresses. 

Another common source of false alarms is low-level library routines in libc. 
Handwritten assembly language implementing some of the string functions is 
particularly problematic because it performs integer operations on sequences of 
four bytes. We did not wish to relax the type rules to the point where these 
operations are accepted because it would weaken the checking too much. In- 
stead, we provide a way for the programmer to supply the type checker with a 
list of function names and specific lines of code for which no warnings should 
be reported. By default, warnings are turned off for the most problematic 15 
functions in libc, including memcpy, strlen, and tzset. Even though warnings 
are not reported for these functions, they still update the shadow memory in the 
expected way. 

5 Evaluation 

Error Detection. We begin by describing our experiences applying the Hobbes 
type checker to the student projects from an undergraduate compilers class at 
Williams College. The assignments implement a compiler for a subset of the C 
language. Each assignment contains 3000-6000 lines of C code, plus a 3000 line 
parsing library, and they use the libc string, file, and memory routines. 
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Table 1. Errors found by the Hobbes type checker in a set of compiler assignments 
from an undergraduate class at Williams College. 



Program 


LOG 


Unallocated 


Uninitialized 


Type Error 


Ealse Alarms 


pi 


5,600 


1 


2 


2 


1 


P2 


4,033 


1 


2 


1 


1 


p3 


3,571 


2 


3 


0 


1 


p4 


4,260 


1 


1 


1 


1 


p5 


4,671 


2 


2 


1 


1 



Table [H summarizes the results of running each assignment on 15 sample 
inputs. The table shows the number of accesses to unallocated or uninitialized 
memory, type errors, and false alarms reported by our tool. An error reported 
multiple times on different inputs is only counted once in the table. In addition, 
Hobbes suppresses duplicate warning messages and cascading warnings caused 
by an error reported earlier in a run. For example, if a program reads an unini- 
tialized value, later warnings on that memory location or the value that was read 
are not reported. 

The type checker reported memory errors in all five programs. The causes of 
these errors include calling free on the address of a global variable, accessing 
memory after it was deallocated, and incorrectly assuming that a routine in the 
parsing library initialized fields of a structure returned to the client. 

The type checker also caught a number of type errors in the programs. In 
Pl, two type errors were found. First, due to incorrect pointer arithmetic, the 
program overwrote an integer stored in memory with a pointer value. When that 
memory location was later read and multiplied by an integer, the type checker 
reported a type mismatch on the operands of the multiply instruction. Purify 
would not have caught this error since the bad pointer arithmetic would always 
yield a location allocated to the program. In addition, a function in pi passed 
a pointer as an argument to a function declared to take an int as a parameter. 
Inside the body of the function, the integer was cast back to a pointer and 
dereferenced. The type checker reported the mismatch between parameter and 
argument type. This code does not work properly on systems where an int is too 
small to hold a pointer, but the compiler did not warn of the problem because 
the programmer had not written a prototype for the function being called. A 
similar mistake was found in p5. The remaining two type errors, in p2 and p4, 
were caused by improper uses of unions similar to Example 1 in Section [2l 

The type checker erroneously reported one additional type error in each of the 
five programs. Each program implemented a hashtable with pointer values for 
keys. In each program, the function to compute hash codes generated a warning 
because it performed arithmetic on a pointer. 

Clearly, false alarms posed no serious impediment to using the Hobbes type 
checker on the compiler assignments. To further explore the impact of false 
alarms on the utility of the Hobbes type checker, we also checked a number of 
larger, robust UNIX utilities. In general, the false alarm rate was acceptable. For 
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Table 2. Performance measurements for the SPECint 2000 benchmark. All times 
are in seconds and are the average of three stable runs. The Ratio columns indicate 
performance slowdowns relative to the Base Time. 



Program 


Base 

Time 


Interpreter 


Instrumented 


MemCheck 


TypeCheck 


Time 


Ratio 


Time 


Ratio 


Time 


Ratio 


Time 


Ratio 


164.gzip 


3.0 


147.9 


49 


219.6 


73 


268.0 


89 


470.0 


157 


175 .vpr 


3.1 


279.6 


90 


334.1 


108 


385.1 


124 


532.5 


172 


176 .gcc 


2.3 


93.5 


41 


136.6 


59 


164.5 


72 


286.0 


124 


181. mcf 


0.4 


8.6 


22 


12.0 


30 


15.0 


38 


21.0 


53 


186 . crafty 


5.5 


336.8 


61 


473.9 


86 


592.8 


108 


1030.3 


187 


197 .parser 


5.4 


178.2 


33 


257.3 


48 


311.4 


58 


486.1 


90 


252 . eon 


3.9 


297.5 


76 


366.5 


94 


483.1 


124 


681.9 


175 


254 . gap 


1.4 


52.9 


38 


76.5 


55 


97.8 


70 


160.6 


115 


255 . vortex 


9.5 


472.4 


50 


687.9 


72 


889.5 


94 


1396.6 


147 


256 .bzip2 


13.9 


587.6 


42 


791.3 


57 


930.4 


67 


1953.0 


141 


SOO.twolf 


0.4 


15.75 


39 


21.7 


59 


25.7 


64 


44.2 


111 


median 






42 




59 




72 




141 



example, running Is with a number of different command line options netted a 
total of 8 spurious warning all of which were caused by the use of system calls 
not yet handled by Hobbeo Several runs of grep generated spurious warnings, 
but only about uses of memory management routines in the obstack library, 
which implements a dynamic memory manager to be used in place of malloc 
and free. Since the obstack routines affect the allocation status of memory, they 
require special handling to be treated correctly by Hobbes and other run-time 
analysis tools |4]. Even in programs with much higher false alarm rates, their 
causes could usually be tracked to only a few problematic code sequences. Hobbes 
reported approximately 60 and 300 spurious warnings for runs of vi and bash, 
respectively. A large fraction of these false alarms are attributed to unhandled 
system calls, hash functions, and a small number of other code sequences. 

We also verified that Hobbes could catch many classes of errors by running 
it on a test suite of programs with deliberate errors, all of which Hobbes found. 

Hobbes catches a number of errors earlier when code is compiled without 
optimizations. Without optimizations, all local variables (and many intermediate 
values) reside on the stack, where they are marked invariant. Thus, errors can be 
caught as soon as an invalid value is written to a variable. In contrast, optimized 
code uses registers, which are not marked invariant, more heavily. 

Performance. We applied the Hobbes checker to the SPECint 2000 bench- 
marks to evaluate the performance of our tool. Table [2] shows execution times 
and slowdowns for the interpreter and the interpreter instrumented with three 
different tools: a tool with empty analysis routines, a memory checker similar to 
Purify, and the type checker. All measurements are the average of three stable 

^ Note that without the suppression techniques for code in libc described in the 
previous chapter, this number would be higher. 
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runs on a dual-processor 1 GHz Pentium III machine with 1 GB of main memory 
running the Redhat Linux 2.4.9-smp kernel. We omit the 253.perlbmk bench- 
mark because it creates new processes to do most of the computation and does 
not accurately reflect the impact of using the interpreter. The interpreter incurs 
a slowdown of 42 times over the base time. Most of this time is spent decoding 
the x86 instruction stream. Installing empty analysis routines for all opcodes in- 
creases the slowdown from 42 to 59. The interpreter spends the additional time 
storing registers and setting up activation records for the analysis routines. 

MemCheck, a memory checker using a Purify-style checking algorithm, main- 
tains allocated and initialized bits for each byte of memory. Unlike Purify’s ap- 
proach, MemGheck also shadows the registers with similar information to catch 
uses of uninitialized data. The memory checker increases the slowdown from 59 
to 72. A slowdown of 13 relative to the base time is consistent with our own 
experience using a tool for Alpha executables based on binary modification and 
with reported measurements of Purify (HUD]. 

The Hobbes type checker runs roughly 140 times slower than normal execu- 
tion, versus a slowdown of 59 for the empty analysis routines. The type checker 
has not been optimized for speed, and there are several significant ways to im- 
prove the performance of our prototype. Each instrumentation function typicaliy 
checks memory safety first and then type safety. While this separation of tasks 
keeps the implementation straightforward, the two steps duplicate a nontrivial 
amount of work. We believe that restructuring the code to eliminate this over- 
lap and further optimizing shadow memory operations will substantially improve 
performance. Additional improvements are also obtainable by switching from an 
interpreter to a binary translator and performing static analysis to reduce the 
number of instructions that must be instrumented. Finally, Hobbes is primarily 
a tool for testing a system, when performance is less important than correctness. 

6 Related Work 

Many projects have focused on identifying errors in G programs. We first describe 
other dynamic tools, and then a few static tools that target low-level code. 

Purify |3], described in Section [H was the first widely used memory access 
checker. Hobbes tracks a superset of the information tracked in Purify’s shadow 
memory and is capable of identifying the same class of memory access errors. 
Memory errors that result from earlier type errors will be caught sooner in our 
system since Hobbes identifies them at the time of the type error. Valgrind is a 
more recent implementation of a Purify-like checker for Linux binaries m- 

Other memory access checkers change the representation of pointers in the 
target program to include capabilities | 15|6| l|4j. For example, Austin et al. pp 
extends the standard pointer representation to include a base and bounds for the 
block being referenced. Compiler-inserted code checks this extra information at 
each memory access. Such capability-based approaches can catch errors that Pu- 
rify (and Hobbes) miss, such as when illegal pointer arithmetic yields a reference 
to some valid piece of memory. However, they are not compatible with standard 
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compiled C code. Jones and Kelly [4| store pointer base and bounds information 
separately, thereby achieving a higher degree of backward compatibility. 

Patil and Fisher m demonstrated that it is sometimes possible to perform 
program checking in parallel with the target program execution. They present a 
memory access checker that incurs a slowdown as low as 10% by using a second 
processor to check the correctness of pointer operations. 

C-Cured [TO] employs a type inference scheme to statically determine which 
pointers in the target are used safely and which may be used improperly. Run- 
time checks are then inserted to check operations involving potentially unsafe 
pointers. C- Cured uses an extended pointer representation for these checks. This 
combination of static and dynamic analysis prevents memory access errors and 
slows down most programs by less than a factor of two, but reliance on non- 
standard pointer representations limits its effectiveness in some situations. 

Loginov et al. [Z] present a run-time type checker that uses a shadow memory 
similar to ours. However, they use source-to-source translation to embed the 
checking and maintenance code into the target. Thus, they cannot effectively 
check or track types through functions in compiled libraries, and they handle 
only programs written entirely in C. These problems also exist in several other 
run-time type checkers, such as Saber C jS]. Their tool is faster than Hobbes 
because it instruments only source-level expressions, and not every assembly- 
language instruction. On standard benchmarks, their tool caused roughly a 50- 
fold slowdown. We believe that switching to binary translation for Hobbes would 
eliminate most of this performance difference. A reasonable balance between 
precision and performance could also be obtained by inserting source code checks 
wherever possible and binary code checks when source code is not available or 
external libraries are used. 

Several recent studies present static analysis techniques for C and assembly 
programs that would be very useful to incorporate into Hobbes. For example, 
Chandra and Reps devised physical type checking to check casts between differ- 
ent structures [2]. They characterize safe casts and define structural sub typing 
for C by considering the physical layout of structures. Their checking tool can 
successfully identify potentially unsafe casts in large programs m- Xu et al. ng 
focus on the related problem of inferring a valid typing for a compiled program 
to ensure type safety before executing it. They employ abstract interpretation 
to construct a static approximation of the types of registers and memory at each 
program point. In addition, Mycroft presents a way to reconstruct C structure 
declarations from their use in assembly code using type inference 1^. These last 
two techniques would be particularly useful for reconstructing type information 
in situations where it is not readily available to Hobbes. Morrisett et al. [8] 
present a type system for x86 assembly language, but it is very different than 
the one underlying the Hobbes type checker because it was designed to support 
compilation from a type-safe high-level language, and not from C. 
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7 Conclusions and Future Work 

Program analysis tools to identify defects in code written in unsafe languages 
are necessary to improve the reliability of many software systems. The Hobbes 
type checker can identify a large class of type errors in such systems. While our 
initial experiments demonstrate the effectiveness of the Hobbes methodology, we 
would like to improve two key aspects of our system. 

Performance. Although the Hobbes interpreter provides a reasonable first 
prototype, implementing the type checker with a binary translation tool would 
significantly improve performance. Additional performance gains can also be 
obtained by eliminating the need to instrument every instruction. For example, 
static analysis could identify code fragments that are guaranteed to be type safe 
or that do not modify the program type state. 

Precision. We would like to incorporate type inference techniques similar to 
those of Mycroft [Sj and Xu et al. m to improve precision when full debugging 
information is unavailable. In addition, we believe that distinguishing different 
pointer types and identifying boundaries between structure fields and array ele- 
ments would allow the Hobbes type checker to find some classes of errors sooner 
than it currently does. We have designed extended type encodings for this infor- 
mation, but we have not yet evaluated how best to use it. 
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Abstract. Popular mobile code architectures (Java and .NET) include 
verifiers to check for memory safety and other security properties. Since 
their formats are relatively high level, supporting a wide range of source 
language features is awkward. Further compilation and optimization, 
necessary for efficiency, must be trusted. We describe the design and 
implementation of a fully type-preserving compiler for Java and ML. 
Its strongly-typed intermediate language provides a low-level abstract 
machine model and a type system general enough to prove the safety 
of a variety of implementation techniques. We show that precise type 
preservation is within reach for real-world Java systems. 



1 Introduction 

There is increasing interest in program distribution formats that can be checked 
for memory safety and other security properties. The Java Virtual Machine 
(JVM) P performs conservative analyses to determine whether the byte codes of 
each method are safe to execute. Its class file format contains type signatures and 
other symbolic information that makes verification possible. Likewise, the Com- 
mon Intermediate Language (CIL) of the Microsoft .NET platform [2j includes 
type information and defines verification conditions for many of its instructions. 

As a general distribution format, JVM class files are very high-level and quite 
partial to the Java language. The byte-code language (JVML) includes no facil- 
ities for specifying data layouts or expressing the results of standard optimiza- 
tions. Compiling other languages for the JVM means making foreign constructs 
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look and act like Java classes or objects. That so many translations exist |3 is 
a testament to the utility of the mobile code concept, and to the ubiquity of 
the JVM itself. To some extent, CIL alleviates these problems. It supports user- 
defined value types, stack allocation, tail calls, and pointer arithmetic (which 
is outside the verifiable subset). Even so, a recent proposal to extend CIL for 
functional language interoperability [4| added no fewer than 6 new types and 12 
new instructions (bringing the total number of call instructions to 5) and it still 
does not support ML’s higher-order modules or Haskell’s constructor classes. 

Another problem with both of these formats is that they require further com- 
pilation and optimization to run efficiently on real hardware. Since these phases 
occur after verification, they are not guaranteed to preserve the verified safety 
and security properties. Bugs in the compiler may have security implications, so 
the entire compiler must be trusted. 

The idea of type-preserving compilation is to remove the compiler from the 
trusted code base (TCB) by propagating type information through all the com- 
pilation and optimization passes. Every representation from the source down 
to the object code supports verification. Object formats developed in this con- 
text include Typed Assembly Language (TAL) and Proof- Carrying Code 

(pcc) m. 

Many compilers — including Marmot [Tj, Intel’s VM |S], and NaturalBridge 
Bullet Train |2] — preserve some kind of type information in their intermediate 
code, but none are rigorous enough to support verification. Lower-level code 
requires more sophisticated type systems. As we will demonstrate, annotations 
that merely distinguish between integers, floats, and objects of distinct classes 
are insufficient. Types must enforce subtle invariants, for which logical constructs 
(such as quantification) are useful. 

Our previous work mm developed type-theoretic encodings of many Java 
features. We proved useful properties, such as type preservation and decidabil- 
ity, but always our goal was to implement the encodings in a practical com- 
piler. In fact, we rejected the classic object encodings m because their runtime 
penalties — superfluous indirections and function calls — were too high. 

This paper describes the design and implementation of a compiler based on 
our encodings. It is the first practical system to use a higher-order polymorphic 
intermediate language to compile both functional and object-oriented source 
languages. Additionally, it has the following features: 

— Eront ends for both Standard ML [13| and JVML that share optimizations 
and code generators. Programs from either language run together in the 
same interactive runtime system. 

— AJVM, our high-level intermediate language (IL) in the Java front end, uses 
the same primitive instructions and types as JVML, but is easier to verify 
and more amenable to optimization (see section ISl). 

— JElint, our low-level generic IL, includes function declarations, arrays and 
structures, and the usual branches and numeric primitives. Its type system 
includes logical quantifiers (universal, existential, fixed point) and rows [TT] 
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for abstracting over structure suffixes. The instruction stream includes ex- 
plicit type operations that guide the verifier. 

— Unlike the CIL extension [4|, our design supports a pleasing synergy between 
the encodings of Java and ML. JFlint does not, for example, treat Java 
classes or ML modules as primitives. Rather, it provides a low-level abstract 
machine model and sophisticated types that are general enough to prove 
the safety of a variety of implementation techniques. We expand on this in 
section |H 

— Nothing about our instruction set should surprise a typical compiler devel- 
oper. Type operations must appear periodically, but most occur in canned 
sequences that can easily be treated as macros. Although the detailed type 
information can be quite large, our graph representation maintains opti- 
mal sharing. Type annotations within the code are merely pointers into this 
graph. For debugging purposes, we print the type annotations using short, 
intuitive names such as Inst Of [java/lang/Object] . 

— All types are discarded after verification, leaving concise and efficient code, 
exactly as an untyped compiler would produce. 

Our thesis, in short, is that precise type preservation is within the reach of 
practical Java systems. 

The next section introduces a detailed example to elucidate some of the 
issues in certifying compilation of object-oriented languages, and to distinguish 
our approach from that of Cedilla Systems |15]. We postpone discussion of other 
related projects to section El 

2 Background: Self- Application and Special J 

We begin by attempting to compile the most fundamental operation in object- 
oriented programming: virtual method invocation. 

public static void deviant (Object x, Object y) 

{ X. toStringO ; } 

The standard implementation adds an explicit self argument (this) to each 
method and collects the methods into a per-class structure called a viable. Each 
object contains a pointer to the vtable of the class that created it. To invoke 
a virtual method, we load the vtable pointer from the object, load the method 
pointer from the vtable, and then call the method, providing the object itself as 
this. 

public static void deviant (Object x, Object y) 

{ if (x is null) throw NullPointerException; 
rl = x.vtbl; 
r2 = rl.toString; 
call r2 (x) ; } 

A certifying compiler must justify that the indirect call to r2 is safe; this is 
not at all obvious. Since x might be an instance of a subclass, the method in 
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r2 might require additional fields and methods that are unknown to the caller. 
Self- application works thanks to a rather subtle invariant. One way to upset that 
invariant is to select a method from one object and pass it another object as the 
self argument. For example, replace just the last instruction above with call r2 

(y)- 

This might seem harmless; after all, both x and y are instances of Object. 
It is unsound, however, and any unsoundness can be exploited. Suppose class 
Int extends Object by adding an integer field; class Ref adds a byte vector and 
overrides to St ring: 

class Ref extends Object 
{ public byte[] vec; 

public String toStringO 
{ vec [13] = OxFF; return "Ha ha!"; } 

} 

Then, calling the deviant method as follows: 
deviant (new Ref (...), new Int (...)); 

will jump to Ref . toStringO with this bound to the Int object. Thus, we 
use an arbitrary integer as an array pointer. This is one reason why virtual 
method calls are atomic operations in both JVML and CIL. How to enforce the 
self-application invariant in lower-level code is not widely understood. 

Cedilla Systems developed Special J a proof-carrying code compiler for 
Java. Their paper described the design, defined some of the predicates used in 
verification conditions, explained their approach to exceptional control flow, and 
gave some experimental results. Their running example was hand-optimized code 
including a loop, an array field, and an exception handler. 

Unfortunately, their paper did not adequately describe the safety conditions 
for virtual method calls. In communication with the authors, we discovered that 
their current system indeed does not properly enforce the necessary invariant 
on self-application [16|. It gives the type “vtable of Object” to rl and the 
type “implementation of String Object .toStringO” to r2. The verification 
condition for the call requires only that the static class of the self argument 
matches the static class of the object from which the method was fetched. As a 
result, the consumer’s proof checker will accept the malicious code given above. 

Necula claims that this hole can be patched jl6], but it has still not been 
addressed in subsequent work m- One weakness in the Cedilla PCC architecture 
is that the rules for the source language are part of the trusted code base. If they 
are unsound, all bets are off. Moreover, the rules and the code have different levels 
of granularity. PCC is machine code, but its logical predicates refer specifically 
to Java constructs such as objects, interfaces, and methods. To support another 
language, an entirely new set of language-specific predicates and rules must be 
added to the TCB. 

In the next section, we briefly survey the architecture of our compiler. Its 
key strongly- typed intermediate language is the topic of section [H 
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3 Architecture of Our Compiler 

Standard ML of New Jersey is an interactive runtime system and compiler based 
on a strongly- typed intermediate language called FLINT [18|. We extended the 
FLINT language of version 110.30 and implemented a new front end for Java 
class files. We updated the optimization phases to recognize the new features. 
The code generator and runtime system remain unchanged. 

The Java front end parses class files and converts them to a high-level IL 
called A JVM. This language uses the same primitive instructions and types as 
JVML. The difference is that A JVM replaces the implicit operand stack and 
untyped local variables with explicit data flow and fully-typed single- assignment 
bindings. This alternate representation has several advantages. First, it is sim- 
pler to verify than JVML, because all the hard analyses (object initialization, 
subroutines, etc.) are performed during translation and their results preserved 
in type annotations. The type checker for AJVM is just 260 lines of SML code. 
Second, as a functional IL, it is (like static single assignment form) amenable 
to further analysis and optimization [EEn]. Although we have not implemented 
them, this phase would be suitable for class hierarchy analysis and various object- 
aware optimizations m because the class hierarchy and method invocations are 
still explicit. 

We designed AJVM so that its control and data flow mimic that of JFlint. 
This means that the next phase of our compiler is simply an expansion of the 
JVML types and operations into more detailed types and lower-level code. For 
further details about AJVM, please see m- 

On JFlint, we run several contraction optimizations (inlining, common subex- 
pression elimination, etc.), and type-check the code after each pass. Since method 
invocations are no longer atomic in JFlint, these optimizations readily lift and 
merge vtable accesses. A future version of the JFlint type system will even have 
support for optimizing array bounds checks m- 

We discard the type information before converting to MLRISC [2lj for final 
instruction selection and register allocation. To generate typed machine code, 
we would need to preserve types throughout the back end. The techniques of 
Morrisett et al. [S] should apply directly, since JFlint is based on System F. 

Figure [H demonstrates the SML/ JFlint system in action. The top-level loop 
accepts Standard ML code, as usual. The JFlint subsystem is controlled via the 
Java structure; its members include: 

— Java. classPath : string list ref 

Initialized from the CLASSPATH environment variable, this is a list of direc- 
tories where the loader will look for class files. 

— Java. load : string -> unit 

looks up the named class using classPath, resolves and loads any depen- 
dencies, then compiles the byte codes and executes the class initializer. 

— Java. run : string -> string list -> unit 

ensures that the named class is loaded, then attempts to call its main method 
with the given arguments. 
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Standard ML of New Jersey vllO.30 [JFLINT 1.2] 

- Java. classPath := ["/home/league/r/ java/tests"] ; 
val it = 0 : unit 

- val main = Java. run "Hello"; 

[parsing Hello] 

[parsing java/lang/Object] 

[compiling java/lang/Object] 

[compiling Hello] 

[initializing java/lang/Object] 

[initializing Hello] 

val main = fn : string list -> unit 

- main ["Duke"] ; 

Hello , Duke 

val it = 0 : unit 

- main [] ; 

uncaught exception Array IndexOut Of Bounds 

raised at: Hello .main ( [Ljava/lang/String; )V 

- 



Fig. 1. Compiling and running a Java program in SML/NJ. 

The session in figure [T] sets the classPath, loads the Hello class, and binds its 
main method, using partial application of Java. run. The method is then invoked 
twice with different arguments. The second invocation wrongly accesses argv [0] ; 
this error surfaces as the ML exception Java. ArrayIndexOutOf Bounds. 

This demonstration shows SML code interacting with a complete Java pro- 
gram. Since both run in the same runtime system, very fine-grained interactions 
should be possible. Benton and Kennedy [2H] designed extensions to SML to 
allow seamless interaction with Java code when both are compiled for the Java 
virtual machine. Their design should work quite well in our setting also. 

Ours is essentially a static Java compiler, as it does not handle dynamic 
class loading or the java. lang. ref lect API. These features are more difficult 
to verify using a static type system, but they are topics of active research. The 
SML runtime system does not yet support kernel threads, so we have ignored 
concurrency and synchronization. 

Finally, our runtime system does not, for now, dynamically load native code. 
This is a dubious practice anyway; such code has free reign over the runtime 
system, thus nullifying any safety guarantees won by verifying pure code. Nev- 
ertheless, this restriction is unfortunate because it limits the set of existing Java 
libraries that we can use. 



4 Overview of the JFlint IL 

To introduce the JFlint language, we begin with a second look at virtual method 
invocation in Java: below is the expansion into JFlint of a Java method that takes 
Objects X and y and calls x . toStringO . 
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obedient (x, y : InstOf [java/lang/Object] ?) = 
switch (x) 

case null: throw NullPointerException; 
case non-null xl : 

x2 : Self [java/lang/Object] fl ml> 

= OPEN xl; 
x3 = UNFOLD x2; 
rl = xS.vtbl; 
r2 = rl.toString; 
call r2 (x2); 

The dots at left indicate erasable type operations. The postfix ? indicates that 
the arguments could be null. The code contains the same operations as before: 
null check, two loads, and a call. The null check is expressed as a switch that, in 
the non- null case, binds the new identifier xl to the value of x, but now with type 
InstOf [java/lang/Object] (losing the ?). It is customary to use new names 
whenever values change type, as this dramatically simplifies type checking. 



4.1 Type Operations 

The new instructions following the null check (OPEN and UNFOLD) are type oper- 
ations. InstOf abbreviates a particular existential type (we clarify the meanings 
of the various types in section 14. 4|) : 

InstOf [java/lang/Object] = 

exists fO, mO: Self [java/lang/Object] fO mO 

OPEN eliminates the existential by binding fresh type variables (f 1 and ml in the 
example) to the hidden witness types. Likewise, Self abbreviates a fixed point 
(recursive) type: 

Self [java/lang/Object] fi mi = 

fixpt sO: { vtbl : Meths [java/lang/Object] sO mi; 
hash : int ; 
fi } 

Meths [java/lang/Object] sj mj = 

{ toString : sj -> InstOf [java/lang/String] ; 
hashCode : sj -> int; 
mj(sj) } 

UNFOLD eliminates the fixed point by replacing occurrences of the bound vari- 
able sO with the recursive type itself. These operations leave us with a struc- 
tural view of the object bound to x3; it is a pointer to a record of fields 
prefixed by the vtable (a pointer to a sequence of functions). Importantly, 
the fresh type variables introduced by the OPEN (fl and ml) find their way 
into the types of the vtable functions. Specifically, r2 points to a function of 
type Self [java/lang/Object] fl ml -> InstOf [java/lang/String] . Thus 
the only valid self argument for r2 is x2. The malicious code of section [2I is 
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signature JFLINT = sig 

datatype value (* identifiers and eonstants *) 

= VAR of id I INT of Int32 . int | STRING . . . 



datatype exp 



= 


LETREC 


of 


fundee list * exp 








1 


LET 


of 


id * exp * exp 








1 


CALL 


of 


id * value list 








1 


RETURN 


of 


value 








1 


STRUCT 


of 


value list 


* 


id 


* exp 


1 


LOAD 


of 


value * int 


* 


id 


* exp 


1 


STORE 


of 


value * int * value 






* exp 












(* 


type manipulation instructions *) 


1 


INST 


of 


id * ty list 


* 


id 


* exp 


1 


FOLD 


of 


value * ty 


* 


id 


* exp 


1 


UNFOLD 


of 


value 


* 


id 


* exp 


1 


PACK 


of 


ty list * (value*ty) list 


* 


id 


* exp 


1 


OPEN 


of 


value * id list * (id*ty) 


list 


* exp 



withtype fundee = id * (id * ty) list * exp 
end 



Fig. 2. Representation of JFlint code. 

rejected because opening y would introduce brand new type variables (f2 and 
m2, say); these never match the variables in the type of r2. The precise typing 
rules for UNFOLD and OPEN are available elsewhere [n|26| . 

After the final verification, the type operations are completely discarded and 
the aliased identifiers are renamed. This erasure leaves us with preeisely the same 
operational behavior that we used in an untyped setting. Like other instructions, 
type manipulations yield to simple optimizations. We can, for example, eliminate 
redundant OPENs and hoist loop-invariant UNFOLDs. In fact, using online common 
subexpression elimination, we avoid emitting redundant operations in the first 
place. For a series of method calls and field accesses on the same object, we would 
OPEN and UNFOLD it just once. Although the type operations have no runtime 
penalty, optimizing them is advantageous. First, fewer type operations means 
smaller programs and faster compilation and verification. Second, excess type 
operations often hide further optimization opportunities in runtime code. 



4.2 Code Representation 

Our examples use a pretty-printed surface syntax for JFlint. Figure El contains a 
portion of the SML signature for representing such code in our compiler. Iden- 
tifiers and constants comprise values. Instructions operate on values and bind 
their results to new names. Loads and stores on structures refer to the integer 
offset of the field. Function declarations have type annotations on the formal 
parameters. Non-escaping functions whose call sites are all in tail position are 
very lightweight, more akin to basic blocks than to functions in C. 
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This language is closer to machine code than to JVML, but not quite as 
low-level as typed assembly language. Allocating and initializing a structure, 
for example, is one instruction: STRUCT. Similarly, the CALL instruction passes 
n arguments and transfers control all at once; the calling convention is not 
explicit. It is possible to break these down and still preserve verifiability [^, but 
this midpoint is simpler and still quite useful for optimization. 

There are two hurdles for a conventional compiler developer using a strongly- 
typed IL like JFlint. The first is simply the functional notation, but it can be 
understood by analogy to SSA. Moreover, it has additional benefits such as 
enforcing the dominator property and providing homes for type annotations m- 
The second hurdle is the type operations themselves: knowing where to insert 
and how to optimize them. The latter is simple; most standard optimizations 
are trivially type-preserving. Type operations have uses and defs just like other 
instructions, and type variables behave (in most cases) like any other identifier. 

As for knowing what types to define and where in the code to insert the type 
operations: we developed recipes for Java primitives [WTH : some of these appear 
in figureO A thorough understanding of the type system is helpful for developing 
successful new recipes, but experimentation can be fruitful as long as the type 
checker is used as a safety net. Extending the type system without forfeiting 
soundness is, of course, a more delicate enterprise; a competent background in 
type theory and semantics is essential. 

4.3 Interfaces and Casts 

The open-unfold sequence used in method invocation appears whenever we need 
to access an object’s structure. Getting or setting a field starts the same way: 
null check, open, unfold (see the first expanded primop in figure ^ . 

Previously, we showed the expansion of Inst Of [C] as an existential type. 
Suppose D extends C; then, Inst Of [D] is a different existential. In Java, any 
object of type D also has type C. To realize this property in JFlint, we use 
explicit type coercions. (This helps keep the type system simple; otherwise we 
would need F-bounded quantifiers with ‘top’ subtyping |2^.) A JVM marks 
such coercions as upeasts. They are expanded into JFlint code just like other 
operators. 

An upcast should not require any runtime operations. Indeed, apart from the 
null test, the upcast recipe in figure [HI is nothing but type operations: open the 
object and repackage it to hide more of the fields and methods. Therefore, only 
the null test remains after type erasure: (x == null? null : x). This is easily 
recognized and eliminated during code generation. 

In Java, casts from a class to an interface type are also implicit (assuming the 
class implements the interface). On method calls to objects of interface type, a 
compiler cannot statically know where to find the interface method. Most imple- 
mentations use a dynamic search through the vtable to locate either the method 
itself, or an embedded itable containing all the methods of a given interface. 
This search is expensive, so it pays to cache the results. With the addition of 
unordered (permutable) record types and a trusted primitive for the dynamic 
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putfield C.f (x : InstOf [C] ? ; y : T) 
switch (x) case null: throw NullPointerException; 
case non-null xl : 

. <f3,m3; x2 : Self [C] f3 m3> = OPEN xl; 

x3 = UNFOLD x2; 
x3.f := y; 

upcast D,C (x : InstOf [D] ?) 
switch (x) case null: return null : InstOf [C]?; 
case non-null xl : 

. <f4,m4; x2 : Self [D] f4 m4> = OPEN xl; 

x2 = PACK f5=NewFlds[D] f4, m5=NewMeths [D] m4 
WITH xl : SelfEC] f5 m5; 
return x2 : InstOf [C]?; 

invokeinterface I.m (x : IfcObj[I]?; vl...vn) 
switch (x) case null: throw NullPointerException; 
case non-null xl : 

. <t; xl : IfcPairEl] t> = OPEN xl; 

rl = xl.itbl; 
r2 = xl . obj ; 
r3 = rl.m; 

call r3 (r2, vl, vn) ; 

Fig. 3. Recipes for some AJVM primitives. 



search, interface types pose no further problems. Verifying the searching and 
caching code in a static type system would be quite complex. As an experiment, 
we implemented a unique representation of interfaces for which the dynamic 
search is unnecessary m- 

In our system, interface calls are about as cheap as virtual calls (null check, 
a few loads and an indirect call). We represent interface objects as a pair of the 
interface method table and the underlying object. To invoke a method, we fetch 
it from the itable and pass it the object as the self argument. This implies a 
non-trivial coercion when an object is upcast from a class to an interface type, 
or from one interface to another: fetch the itable and create the pair. Since all 
interface relationships are declared in Java, the itables can be created when each 
class is compiled, and then linked into the class vtable. Since the layout of the 
vtable is known at the point of upcast, dynamic search is unnecessary. 

The final recipe in figure 0 illustrates this technique. The new type abbrevi- 
ations for representing interface objects are, for example: 

IfcObj Ejava/lang/Runnable] = 

exists t . If cPair Ejava/lang/Runnable] t 
If cPair Ejava/lang/Runnable] t = 

{ itbl : { run : t -> void }, obj : t } 

The existential hides the actual class of the object. Just as with virtual invo- 
cation, the interface invocation relies on a sophisticated invariant. A method 
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signature JTYPE = sig 



type ty 






val 


var 


int * int -> ty 




val 


arrow 


ty list * ty -> ty 




val 


struct 


ty -> ty 




val 


row 


ty * ty -> ty 




val 


empty 


int -> ty 




val 


exists 


kind list * ty list -> 


ty 


val 


f ixpt 


kind list * ty list -> 


ty 


val 


lam 


kind list * ty -> ty 




val 


app 


ty * ty list -> ty 





end 



(* type variable *) 
(* funetion type *) 
(* strueture types *) 

(* quantified types *) 

(* higher-order *) 



Fig. 4. Abstract interface for JFlint type representation. 



from the it able must be given a compatible object as the self argument. The 
existential ensures that only the packaged object will be used with methods in 
the it able. 

This scheme also supports multiple inheritance of interfaces. Suppose inter- 
face AB extends both interfaces A (with method a) and B (with method b). The 
itable of AB will contain pointers to itables for each of the super interfaces: To 
upcast from AB to B, just open the interface object, fetch itbl.B, pair it with 
obj , and re-package. 

Unfortunately, Java’s covariant subtyping of arrays (widely considered to 
be a misfeature) is not directly compatible with this interface representation. 
Imagine casting an array of class type to an array of interface type — we would 
need to coerce each element! For the purpose of experimentation, we ignored 
the covariant array subtyping rule. In the future, we would like to find a hybrid 
approach that allows cheap, easily verifiable invocation of interface methods, but 
is still compatible with the Java specification. 



4.4 Type Representation 

To support efficient compilation, types are represented differently from code. 
Figure 01 contains part of the abstract interface to our type system. Most of 
our types are standard: based on the higher-order polymorphic lambda calculus 
(see m for an overview). 

A structure is a pointer to a sequence of fields, but we represent the sequence 
as a linked list of rows. Any tail of the list can be replaced with a type variable, 
providing a handle on suffixes of the structure. The Inst Of definition used an 
existential quantifier to hide the types of additional fields and methods; 
these are rows. 

A universal quantifier — precisely the inverse — allows outsiders to provide 
types; in our encoding, it models inheritance. Subclasses provide new types for 
the additional fields and methods. Kinds classify types and provide bounds for 
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quantified variables. They ensure that rows are composed properly by tracking 
the structure offset where each row begins [14]. 

Our object encodings rely only on standard constructs, so our type system is 
rooted in well- developed type theory and logic. The soundness proof for a similar 
system is a perennial assignment in our semantics course. The essence was even 
formalized in machine- checkable form using Twelf m- 



4.5 Synergy 

Judging from the popular formats, it appears that there are just two ways to 
support different kinds of source languages in a single type-safe intermediate 
language. Either favor one language and make everyone else conform (JVM) 
or incorporate the union of all the requested features (CIL, ILX ||2|4| ). CIL 
instructions distinguish, for example, between loading functions vs. values from 
objects vs. classes. ILX adds instructions to load from closure environments and 
from algebraic data types. 

JFlint demonstrates a better approach: provide a low-level abstract machine 
model and general types capable of proving safety of various uses of the machine 
primitives. Structures in JFlint model Java objects, vtables, classes, and inter- 
faces, plus ML records and the value parts of modules. Neither Java nor ML has 
a universal quantifier, but it is useful for encoding both Java inheritance and 
ML polymorphism. The existential type is essential for object encoding but also 
for ML closures and abstract data types. 

We believe this synergy speaks well of our approach in general. Still, it does 
not mean that we can support all type-safe source languages equally well. Java 
and ML still have much in common; they work well with precise generational 
garbage collection and their exceptions are similar enough. Weakly typed for- 
mats, such as C — [32] 5 are more ambitious in supporting a wider variety of 
language features, including different exception and memory models. Practical 
type systems to support that level of flexibility are challenging; further research 
is needed. 



5 Implementation Concerns 

If a type-preserving compiler is to scale, types and type operations must be 
implemented with extreme care. The techniques of Shao, et al. made the FLINT 
typed IL practical enough to use in a production compiler HE]. Although different 
type structures arise in our Java encodings, the techniques are quite successful. A 
full type-preserving compile of the 12 classes in the CaffeineMark 3.0 embedded 
series takes 2.4 seconds on a 927 MHz Intel Pentium III Linux workstation. This 
is about 60% more than gcj, the GNU Java compiler [33]. Since gcj is written 
in C and our compiler in SML, this performance gap can easily be attributed to 
linguistic differences. Verifying both the AJVM and the JFlint code adds another 
half second. 
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Run times are promising, but can be improved. (Our goal, of course, is to 
preserve type safety; speed is secondary.) CaffeineMark runs at about a third the 
speed in SML/NJ compared to gcj -02. There are several reasons for this dif- 
ference. First, many standard optimizations, especially on loops, have not been 
implemented in JFlint yet. Second, the code generator is still heavily tuned for 
SML; structure representations, for example, are more boxed than they should 
be. Finally, the runtime system is also tuned for SML; to support callcc, every 
activation record is heap-allocated and subject to garbage collection. Benchmark- 
ing is always fraught with peril. In our case, meaningful results are especially 
elusive because we can only compare with compilers that differ in many ways 
besides type preservation. 



6 Related Work 

Throughout the paper, we made comparisons to the Common Intermediate Lan- 
guage (CIL) of the Microsoft .NET platform and ILX, a proposed extension 
for functional language interoperability [4j . We discussed the proof-carrying code 
system Special J [IH] at length in section [2l We mentioned C — [S2], the portable 
assembly language, in section [45l Several other systems warrant mention. 

Benton et al. built MLj, an SML compiler targeting the Java Virtual Ma- 
chine 1^; we mentioned their extensions for interoperability earlier |^. Since 
JVML is less expressive than JFlint, they monomorphize SML polymorphic func- 
tions and functors. On some applications, this increases code size dramatically. 
JVML is less appropriate as an intermediate format for functional languages be- 
cause it does not model their type systems well. Polymorphic code must either 
be duplicated or casts must be inserted. JFlint, on the other hand, completely 
models the type system of SML. 

Wright, et al. m compile a Java subset to a typed intermediate language, 
but they use unordered records and resort to dynamic type checks because their 
system is too weak to type self application. Neal Glew |SS] translates a simple 
class-based object calculus into an intermediate language with F-bounded poly- 
morphism m and a special ‘self’ quantifier. A more detailed comparison with 
this encoding is available elsewhere fnm\ . 

Many researchers use techniques reminiscent of those in our AJVM trans- 
lation format. Marmot converts bytecode to a conventional high-level IL using 
abstract interpretation and type elaboration [TTTTj . Gagnon et al. [SS] give an 
algorithm to infer static types for local variables in JVML. Since they do not 
use a single- assignment form, they must occasionally split variables into their 
separate uses. Since they do not support set types, they insert explicit type 
casts to solve the multiple interface problem. Amme et al. m translate Java to 
SafeTSA, an alternative mobile code representation based on SSA form. Since 
they start with Java, they avoid the complications of subroutines and set types. 
Basic blocks must be split wherever exceptions can occur, and control- flow edges 
are added to the catch and finally blocks. Otherwise, SafeTSA is similar in 
spirit to AJVM. 
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7 Conclusion 

We have described the design and implementation of our type-preserving com- 
piler for both Java and SML. Its strongly- typed intermediate language provides 
a low-level abstract machine model and a type system general enough to prove 
the safety of a variety of implementation techniques. This approach produces a 
pleasing synergy between the encodings of both languages. We have shown that 
type operations can be implemented efficiently and do not preclude optimiza- 
tions or efficient execution. We therefore believe that precise type preservation 
is within reach for real-world Java systems. 
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1 Introduction 

MAGICA (MAthematica system for General-purpose Inferring and Compile-time 
Analyses) is an extensible inference engine that can determine the types (value 
range, intrinsic type and array shape) of expressions in a MATLAB program. 
Written as a Mathematica application, it is designed as an add-on module that 
any MATLAB compiler infrastructure can use to obtain high-quality type infer- 
ences. 



1.1 A Type Inference Using MAGICA 

Lines In[l] and Out [2] below demonstrate a simple interaction with MAGICA 
through a notebook interface^ On line In[l] , the MAGICA type function ob- 
ject is applied on a representation of the MATLAB expression tanh(3.78i). 
MAGICA’s response, shown on Out [2] , is the inferred type of tanh(3.78i). 
In this case, “type” is the expression {v^i^s} where v, i and s are the value 
range, intrinsic type and array shape of tanh(3.78i); Out [2] indicates these 
to be the point 0.742071 a, the $nonreal intrinsic type designator, and the two- 
dimensional array shape with unit extents along both dimensions — that is, the 
scalar shape. 

in[l]:= type [tanh [3 . 78i] ] 

Out[l]= {0.742071 i, $nonreal, {<1, 1>, 2}} 

1.2 Feature Support 

The above is an example of a type inference on a single MATLAB expression. 
MAGICA can infer the types of whole MATLAB programs comprising an ar- 
bitrary number of user-defined functions, each having an arbitrary number of 
statements. User-defined functions can return multiple values, can consist of 

This research was supported by DARPA under Contract E30602-98-2-0144, and 
by NASA under Contract 276685/NAS5-00212. Mathematica® fonts by Wolfram 
Research, Inc. 

^ The outputs in this paper can be exactly reproduced by typing the code shown 
against each In[n] := prompt into a notebook interface to version 1.0 of MAGICA, 
running on Mathematica 4.1. 
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assignment statements, the for and while loops, and the if conditional state- 
ment. (All these MATLAB constructs are explained in [^.) In addition, MAGICA 
can handle close to 70 built-in functions in MATLAB. These include important 
Type II operation|5 like subsref , subsasgn and colon that are used in array 
indexing and colon expressions. For the most part, the full or nearly the full 
semantics of a built-in function, as specified in |3|, is supported. For instance, 
subscripts in array indexing expressions can themselves be arrays, and arrays 
can be complex- valued. Not all of MATLAB ’s features are currently handled; 
these include structures, cell arrays and recent additions like function handles. 

2 Representing MATLAB in MAGICA 

MAGICA symbolically represents constructs in MATLAB. An example of this is 
the Mathematica expression plus [a, b] , which is MAGICA’s representation of 
the MATLAB expression a+b. On line In[l] above, the Mathematica expression 
tanh[3.78ii] was used to denote the MATLAB expression tanh(3.78i). The 
idea of functionally representing a MATLAB expression can also be used to 
denote high-level constructs. For instance, the MATLAB assignment statement 
1 ^ cos (3. 099), where 1 is a MATLAB program variable, is represented in 
MAGICA as shown on line In [2] below. 

in[2]:= assignment [$$lhs 1, $$rhs cos [3. 099]] 

0ut[2]= assignment ( $$lhs 1 , $$rhs cos (3 . 099 ) ) 

The expression’s head is assignment and this is used to uniquely identify MAT- 
LAB assignments. The tags $$lhs and $$rhs serve to identify the assignment’s 
left-hand side and right-hand side. We call 1 and cos [3 . 099] as tag values. A 
tag value can be any expression; this allows for the representation of arbitrary 
MATLAB assignments, including the multiple- value assignment [H]. 

In general, MATLAB statements are represented in MAGICA as 

h[x\ yi^X 2 ^2, • • • , Vn] 

where the head h serves as a construct identifier, and where the delayed rules 
^ Xi yi (1 < i < n) stand for tag-value pairs. MAGICA places no signif- 
icance on the position of a tag- value pair; this point should be kept in mind 
when making new definitions to extend the MAGICA system. A fair amount of 
documentation regarding data structure layouts has been coded into MAGICA 
itself as usage messages [2]; this provides a convenient, on-line way of pulling 
up layout information while interacting with MAGICA. 

^ MATLAB ’s built-in functions can be classified into one of three groups, based on 
how the shapes of the outputs are dependent on the shapes of the inputs ID- Type 
I built-ins produce outputs whose shapes are completely determined by the shapes 
of the arguments, if any. Type II built-ins produce an output whose shape is also 
dependent on the elemental values of at least one input. All remaining built-ins fall 
into the Type III group. 
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In[3] := ?if 



if[$$condition :> c_, $$then :> t_, $$else :> e_] 
is the functional equivalent of an if statement in MATLAB. Forms such as 
$$condition -> c, $$then -> t and $$else -> e can also be used. 



3 Transitive Closure of a Graph 

The two boxes in Figure [T] display a complete MATLAB program that computes 
the transitive closure of a graph. The graph is represented in the NxN adjacency 
matrix A, which is initialized arbitrarily in the function tclosure. Its transitive 
closure is returned in B. The shown code is directly from Alexey Malishevsky’s 
thesis [2], with three nontrivial changes: (1) the tic and toe timing commands 
were removed, (2) disp was used to display B, and (3) the original monolithic 
script was reorganized into two files, one containing the function driver and the 
other containing tclosure. 



3.1 M-File Contexts 

Input files that constitute a MATLAB program are referred to as M-files in 
MATLAB parlance. Every M-file has its own parsed representation in MAGICA, 
which we call an M-file context. Through Mathematica’s information-hiding con- 
text mechanism [1], MAGICA provides a way to save, and later retrieve, the M-file 
contexts of a MATLAB program. On line In [4] below, the M-file contexts of 
the two user-defined functions driver and tclosure, saved in an earlier session 
of MAGICA, are loaded from diskJl 

in[4]:= Scan [load [#, load$Disk ->■ True]&, {"tclosure'", "driver'"}] 

An M-file context is basically a collection of Mathematica definitions that cap- 
ture information about a user-defined MATLAB function. As an example, for a 
user-defined function /, a definition is made against the statements function 
object so that statements [/] expands to the function body of /. This is how 
the type object operates on the statements in the body of driver on line In [5] 
below □ 



in[5] := type [statements [driver] ] // Timing 

Out[5]= {5.47 Second, {_ 12 N1 ^ { 512 . , $integer, {<1, 1>, 2}}, 

_10B1^ {[0, 1], $boolean, {<512 , 512), 2}), 

{Indeterminate, $illegal, {<-1, 1>, 2}}}} 



^ These representations are in ASCII, and can be manually or automatically generated. 
^ The timings are on a 440 MHz Solaris 7 UltraSPARC-IIi having 128MB of main 
memory. 
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function driver 
N ^ 512; 

B ^ tclosure(N); 
disp(B) ; 



function B = tclosure (N) 


end; 


7o Initialization. 


end; 


A ^ zeros (N, N) ; 


end; 


for ii = 1:N, 


B ^ A; 


for jj = 1:N, 


7o Closure. 


if ii*jj < N/2, 


ii ^ N/2; 


A(N-ii, ii+jj) ^ 1; 


while ii >= 1, 


A(ii, N-ii-jj) ^ 1; 


B ^ B*B; 


end; 


ii ^ ii/2; 


if ii == jj. 


end; 


A(ii, jj) ^ 1; 


B ^ B > 0; 



Fig. 1. The MATLAB Transitive Closure Program 



3.2 Interprocedural Type Inference 

The definitions against the type object — currently over a 100 — take care of prop- 
agating information across user-defined function interfaces. Thus the application 
of type on line In [5] causes type information pertaining to N to be propagated 
into tclosure, resulting in the shown type inference for its output variable B. 
On line Out [5] , _12N1 stands for N and _10B1 for B; this renaming is an artifact 
of the way in which the Mathematica representations for this program were au- 
tomatically generated. Out [5] thus shows that the value range of B is [[0, 1]], its 
intrinsic type is $boolean, and that its shape is 512 x 512. The third inference 
on Out [5] represents the type of disp’s outcome; the shown values reflect the 
fact that disp doesn’t return anything. 

4 Architecture 

MAGICA is used through a front- end^ which is a separate operating system pro- 
cess that builds a Mathematica representation of an input MATLAB program. 
The front-end transfers the representation to MAGICA for type analysis; the type 
inferences that MAGICA generates are transferred back, for use in type-related 
optimizations, code generation or simply for code annotation and visualization. 
Exchanges between the front-end and MAGICA happen across an interprocess 
communication link using the MathLink protocol [4]. Figure [2] shows three ex- 
isting front-ends to MAGICA. Two of these — the GUI-based notebook and the 
text-based interface — are shipped with Mathematica. Interacting with MAGICA 
using them requires either the handcrafting of the program representations that 
are to be type inferred, or the availability of those representations on disk. (In [1] 
and In [2] are examples of handcrafted representations; In [4] uses prefabri- 
cated representations.) The third front-end, called is a custom-built one 

that takes a MATLAB program in its native form and translates it to optimized 
C; it uses MAGICA as the inference engine to obtain the necessary type informa- 
tion. In fact, it was by using that the M-file contexts for the example in 
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§ [ 3 ] were produced in advance. Figure [2] also shows the disk image of a sample 
M-file context. 
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$$rhs [mfile 'Namespace 'MATLAB' $ls] sqrt[2] 
inputArgs [$$main$$] {} 

outputArgs [$$main$$] { } 

statements [$$main$$] ■':= Sequence [assignment [ 

$$lhs :> mfile'Namespace'MATLAB'$ls, 

$$rhs :> sqrt[2]], 

disp [mfile 'Namespace' MATLAB '$ls] ] 

variables [$$main$$] {mfile'Namespace'MATLAB'$ls} 

$mfile := {$$main$$} 



Fig. 2. The MAGICA Architecture 



5 Summary 

This paper briefly introduced a software tool called MAGICA that infers value 
ranges, intrinsic types and array shapes for the MATLAB programming lan- 
guage. Though shown in an interactive mode in this paper, MAGICA can also be 
used in a batch mode from a custom front-end. Currently, MAGICA is being used 
this way by a MATLAB-to-C translator that converts a MATLAB source 

to optimized C code. 
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Abstract. There has been approximately a ten year history of reference analyses 
for object-oriented programming languages. Approaches vary as to how different 
analyses account for program execution flow, how they capture calling context, 
and how they model objects, reference variables and the possible calling structure 
of the program. A taxonomy of analysis dimensions that affect precision (and 
cost) will be presented and illustrated by examples of existing reference analysis 
techniques. 



1 Introduction 

Almost 25 years after the introduction of Smalltalk-80, object-orientation is a mature, ac- 
cepted technology. Therefore, it is appropriate now to take a historical look at analyses 
for object-oriented programming languages, examining how they have evolved, par- 
ticularly with respect to ensuring sufficient precision, while preserving practical cost. 
Object-oriented languages allow the building of software from parts, encouraging code 
reuse and encapsulation through the mechanisms of inheritance and polymorphism. 
Commonly, object-oriented languages also allow dynamic binding of method calls, dy- 
namic loading of new classes, and querying of program semantics at runtime using 
reflection. 

To understand the control flow in an object-oriented program requires knowledge of 
the types of objects which can act as receivers for dynamic method dispatches. Thus, to 
know the possible calling structure in a program, the set of possible object types must 
be known; but to determine the set of possible types of objects, some representation of 
possible interprocedural calling structure must be used. Essentially the program repre- 
sentation (i.e., the calling structure) is dependent on the analysis solution and vice versa. 
This interdependent relationship makes analysis of object-oriented languages quite dif- 
ferent from that of procedural languages lITSll . In addition, dynamic class loading may 
require a runtime recalculation of some analysis results ll42l . 

Therefore, there is a fundamental need for reference analysis in any analysis of object- 
oriented languages, in order to obtain a program representation. The term reference 
analysis is used to define an analysis that seeks to determine information about the 
set of objects to which a reference variable or field may point during execution. This 
study will discuss the dimensions of reference analysis which lead to variations in the 
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precision obtained in its solution. Examining these dimensions will illustrate similarities 
and differences between analyses, and identify sources of precision and tradeoffs in cost. 
Examples of these dimensions will be discussed in the context of different analyses. Open 
issues not yet fully addressed will also be discussed. 

Optimizing compilers and program development tools, such as test harnesses, refac- 
toring tools, semantic browsers for program understanding, and change impact analysis 
tools, use reference analysis and its client analyses (e.g., side effect analysis, escape 
analysis, def-use analysis). There are real tradeoffs between the usability of the analy- 
sis results in terms of precision and the cost of obtaining them, the time and memory 
required. These tradeoffs are especially significant for interactive tools. It is important, 
therefore, to validate analyses by measures corresponding to their eventual use in client 
applications, even if a full application is not built. Use of benchmark suites which allow 
evaluation of different techniques using the same input data-sets is crucial; more efforts 
for building such suites should be encouraged by the research community. 

This study is not an attempt at an encyclopedic categorization of all analyses of object- 
oriented languages; rather the goal is to enumerate characteristics which differentiate 
the precision (and affect the cost) of different analyses and to give examples of different 
design choices in existing analyses. There are other papers which cover many of the 
existing reference analyses and compare and contrast them 11 81281 . This paper, by design, 
will be incomplete in the set of analyses mentioned. 

Overview. Section[2|presents the dimensions of precision to be discussed and explain 
them intuitively. Section |3]discusses each dimension more fully, cites reference analysis 
examples of choices with respect to that dimension, and then discusses the relative 
influence of that dimension on reference analysis precision (and cost). Section|4]presents 
some open issues with regard to analysis of object-oriented programs. Einally, Section E] 
summarizes these discussions. 



2 Preliminaries 

Recall that reference analysis determines information about the set of objects to which 
a reference variable or reference field may point during execution. Historically, various 
kinds of reference analyses have been developed. Class analysis usually involves calcu- 
lation of the set of classes (i.e., types) associated with the objects to which a reference 
variable can refer during program execution; this information has been used commonly 
for call graph construction. Intuitively, class analysis can be thought of as a reference 
analysis in which one abstract object represents all the instantiations of a class. Points-to 
analysis of object-oriented languages is a term used often for analyses that distinguish 
different instantiations of a class (i.e., different objects). Points-to analyses 12313311 are 
often designed as extensions to earlier pointer analyses for C Eaa. Refers-to analy- 
sis |[45l is a term sometimes used to distinguish a points-to analysis for object-oriented 
languages from a points-to analysis for general-purpose pointers in C. The term refer- 
ence analysis will be used as denoting all of these analyses for the remainder of this 
paper. 

Most of the analyses used here as examples are reference analyses which are fun- 
damental to understanding the semantics of object-oriented programs. Recall from Sec- 



128 



B.G. Ryder 



tion [T] that the interprocedural control flow of an object-oriented program cannot be 
known without the results of these analyses. Thus, other analyses - including side effect, 
escapeQdef-uses, and redundant synchronization analyses - require a reference analysis 
in order to obtain a representation of interprocedural flow for a program. Thus, reference 
analyses are crucial to any analysis of object-oriented code. 

The characteristics or dimensions that directly affect reference analysis precision are 
presented below. The design of a speciflc analysis can be described by choices in each 
of these dimensions. After the brief description here, in Section E] each dimension and 
the possible choices it offers will be illustrated in the context of existing analyses. 

- Flow sensitivity. Informally, if an analysis is flow -sensitive, then it takes into ac- 
count the order of execution of statements in a program; otherwise, the analysis is 
csiWtd flow-insensitive. Flow- sensitive analyses perform strong updates (or kills); 
for example, this occurs when a deflnition of a variable supersedes a previous defl- 
nition. The classical dataflow analyses 12125 1 1911 are flow-sensitive, as are classical 
abstract interpretations E]. 

- Context sensitivity. Informally, if an analysis distinguishes between different call- 
ing contexts of method, then it is context-sensitive', otherwise, the analysis is called 
context-insensitive. Classically, there are two approaches for embedding context 
sensitivity in an analysis, a call string approach and a functional approach |EH]. Call 
strings refer to using the top sequence on the call stack to distinguish the interproce- 
dural context of dataflow information; the idea is that dataflow information tagged 
with consistent call strings corresponds to the same calling context (which is be- 
ing distinguished). The functional approach involves embedding information about 
program state at the call site, and using that to distinguish calls from one another. 

- Program representation (i.e., calling structure). Because of the interdependence 
between possible program calling structure and reference analysis solution in object- 
oriented languages, there are two approaches to constructing an interprocedural 
representation for an object-oriented program. A simple analysis can obtain an ap- 
proximation of the calling structure to be used by the subsequent reference analysis. 
Sometimes this representation is then updated using the analysis solution, when cer- 
tain edges have been shown to be infeasible. Alternatively, the possible call structure 
can be calculated lazily, on-the-fly, interleaved with reference analysis steps. The 
latter approach only includes those methods in the call graph which are reachable 
from program start according to the analysis solution. 

- Object representation. This dimension concerns the elements in the analysis solu- 
tion. Sometimes one abstract object is used to represent all instantiations of a class. 
Sometimes a representative of each creation site (e.g., new) is used to represent all 
objects created at that site. These two naming schemes are those most often used, 
although alternatives exist. 

- Field sensitivity. An object or an abstract object may have its flelds represented 
distinctly in the solution; this is called a fleld- sensitive analysis. If the flelds in an 
object are indistinguishable with respect to what they reference, then the analysis is 
termed fleld- insensitive . 



^ Sometimes reference analysis is performed interleaved with the client analysis, for example on 
II 115149161 . 
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- Reference representation. This dimension concerns whether each reference repre- 
sentative corresponds to a unique reference variable or to groups of references, and 
whether the representative is associated with the entire program or with sections of 
the program (e.g., a method). This dimension relates to reference variables as object 
representation relates to objects. 

- Directionality. Generally, flow-insensitive analyses treat an assignment x = y as 
directional, meaning information flows from y to x, or alternatively as symmetric 
meaning subsequent to the assignment, the same information is associated with x 
and y. These approaches can be formulated in terms of constraints which are uni- 
fication (i.e., equality) constraints for symmetric analyses or inclusion (i.e., subset) 
constraints for directional analyses. 

By varying analysis algorithm design in each of these dimensions, it is possible to 
affect the precision of the resulting solution. The key for any application is to select an 
effective set of choices that provide sufficient precision at practical cost. 



3 Dimensions of Analysis Precision 

There is much in common in the design of pointer analysis for C programs and some 
reference analyses for Java and C++. Both flow- sensitive and context-sensitive tech- 
niques were used in pointer analysis lETl . In general, the analysis community decided 
that flow sensitivity was not scalable to large programs. Context sensitivity for C pointer 
analysis also was explored independent of flow sensitivity , but the verdict on 

its effectiveness is less clear. Keeping calling contexts distinguished is of varying im- 
portance in a C program, depending on programming style, whereas in object-oriented 
codes it seems crucial for obtaining high precision for problems needing dependence in- 
formation, for example. In general, program representation in pointer analysis was on the 
statement level, represented by an abstract syntax tree or flow graph. Solution method- 
ologies included constraint-based techniques and dataflow approaches that allowed both 
context-sensitive and context-insensitive formulations. 

Some reference analyses calculated finite sets of types (i.e., classes) for reference 
variables, that characterized the objects to which they may refer. The prototypical prob- 
lem for which these analyses were used is call graph construction (i.e., dynamic dispatch 
resolution). More recently, reference analyses have been used for discovering redundant 
synchronizations, escaping objects and side-effect analysis II 1I516I49141I36133I26I23I 
l3^ . These client analyses require more precision than call graph construction and thus, 
provide interesting different applications for analysis comparison. 

Recall that the dimensions of analysis precision include: sensitivity, context sen- 
sitivity, program representation, object representation, field sensitivity, reference repre- 
sentation and directionality. In the following discussions, each dimension is considered 
and examples of reference analyses using specific choices for each dimension are cited. 
The goal here is to better understand how these analyses differ, not to select a best 
reference analysis. 
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3.1 Flow Sensitivity 

An early example of a flow- and context-sensitive reference analysis was presented by 
Chatterjee et. al [8|]. This algorithm was designed as a backwards and forwards dataflow 
propagation on the strongly connected component decomposition of the approximate 
calling structure of the program. Although the successful experiments performed on 
programs written in a subset of C++ showed excellent precision of the reference solution 
obtained, there were scalability problems with the approach. 

Whaley and Lam If48ll and Diwan et. al lim designed techniques that perform a 
flow-sensitive analysis within each method, allowing kills in cases where an assignment 
is unambiguous. For example, an assignment p = q does allow the algorithm to re- 
initialize the set of objects to which p may point here only to those objects to which 
q may point; this is an example of a kill assignment. By contrast, the assignment p . f 
= q is not a kill assignment because the object whose f field is being mutated is not 
necessarily unique. This use of flow sensitivity has the potential of greater precision, but 
this potential has not yet been demonstrated for a specific analysis application. 

Given that object-oriented codes generally have small methods, the expected payoff 
of flow sensitivity on analysis precision would seem minimal. Concerns about scalability 
have resulted in many analyses abandoning the use of flow sensitivity, in favor of some 
form of context sensitivity. 



3.2 Context Sensitivity 

Classically, there are two approaches to embedding context sensitivity in an analysis, 
using call strings and functions f38ll . Call strings refer to using the top sequence on the 
runtime call stack to distinguish the interprocedural context of dataflow information; the 
idea is only to combine dataflow information tagged with consistent call strings (that is, 
dataflow information that may exist co-temporally during execution). Work in control 
flow analysis by Shivers ESI, originally aimed at functional programming languages, 
is related conceptually to the Shark and Pnueli call string approach. These control flow 
analyses are distinguished by the amount of calling context remembered; the analyses are 
called k-CFA, where k indicates the length of the call string maintained. The functional 
approach uses information about the state of computation at a call site to distinguish 
different call sites. Some reference analyses that solely use inheritance hierarchy in- 
formation are context-insensitive amii; some later, more precise analyses Hare also 
context-insensitive to ensure scalability (according to their authors) Ill5l47l45l33123l48l . 

Other reference analyses use both the call-string and functional notions of classi- 
cal context sensitivity ll38l . Palsberg and Schwartzbach presented a 1-CFA reference 
analysis 1^ . Plevyak and Chien l(30t described an incremental approach to context sen- 
sitivity, which allows them to refine an original analysis when more context is needed 
to distinguish parts of a solution due to different call sites; their approach seems to 
combine the call string and functional approaches in order to handle both polymorphic 
functions and polymorphic containers. Agesen l|T| sought to improve upon the Palsberg 
and Schwartzbach algorithm by specifically adding a functional notion of context sen- 
sitivity. In his Cartesian product algorithm, he defined different contexts using tuples of 

^ which incorporate interprocedural flow through parameters 
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parameter types that could access a method; these tuples were computed lazily and mem- 
oized for possible sharing between call sites. Grove and Chambers El also explored 
the two notions of context sensitivity in different algorithms, using both call strings and 
tuples of parameter type sets (analogous to Agesen’s Cartesian product algorithm). Mi- 
lanova et. al defined object sensitivity [|26l , a functional approach that effectively allows 
differentiation of method calls by distinct receiver object. 

This active experimentation with context sensitivity demonstrates its perceived im- 
portance in the analysis community as enabling a more precise analysis. The prevalence 
of method calls in an object-oriented program leads to the expectation that more precise 
analyses for object-oriented languages can be obtained by picking the ’right’ practical 
embedding of context sensitivity. 



3.3 Program Representation (i.e., Calling Structure) 

Early reference analyses 115114141 were used to provide a static call graph that initialized 
computation for a subsequent, more precise reference analysis [I29I8I23I45I . Other anal- 
yses constructed the call graph lazily, as new call edges became known due to discovery 
of a new object being referred to 12713 1I48I33I26I23II . Grove and Chambers discuss the 
relative merits of both approaches and conclude that the lazy construction approach is 
preferred m. 

Clearly, the trend is to use the lazy construction approach so that the analysis includes 
a reachability calculation for further accuracy. This can be especially significant when 
programs are built using libraries; often only a few methods from a library are actually 
accessed and excluding unused methods can significantly affect analysis cost as well as 
precision. 



3.4 Object Representation 



Representation choices in analyses often are directly related to issues of precision. There 
are two common choices for reference analysis. First, an analysis can use one abstract 
object per class to represent all possible instantiations of that class. Second, objects can 
be identified by their creation site; in this case, all objects created by a specific new 
statement are represented by the same abstract object. Usually the reason for selecting 
the first representation over the second is efficiency, since it clearly leads to less precise 
solutions. 



I all used one abstract object per class. Some 
later reference analyses made this same choice [451471 citing reasons of scalability. Other 
analyses used creation sites to identify equivalence classes of objects each corresponding 
to one representative object in the reference analysis solution [1331181231481 . There are 
other, more precise object naming schemes which establish finer-grained equivalence 
classes for objects 111 81261271311241 . 

While the use of one abstract object per class may suffice for call graph construction, 
for richer semantic analyses (e.g., side effect, def-use and escape analyses) the use of a 
representative for each object creation site is preferable. 
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3.5 Field Sensitivity 

Another representation issue is whether or not to preserve information associated with 
distinct reference fields in an object. One study indicated that not distinguishing 
object fields may result in imprecision and increased analysis cost. The majority of 
analyses which use representative objects also distinguish fields because of this precision 
improvement. 

It is interesting that Liang et. al reported that there appeared to be little difference 

in precision when fields were used either with an abstract object per class or a represen- 
tative object per creation site with inclusion constraints; more experimentation is needed 
to understand more fully the separate effects of each of the dimensions involved in these 
experiments. 

3.6 Reference Representation 

This dimension concerns to whether or not each reference is represented by a unique rep- 
resentative throughout the entire program. For most reference analyses, this is the case. 
Sometimes, all the references of the same type are represented by one abstract reference 
of that type p3] . Alternatively there can be one abstract reference per method [021l • These 
two alternatives reduce the number of references in the solution, so that the analysis is 
more efficient. 

Tip and Palsberg B71 explored many dimensions of reference representation. Several 
analyses were defined whose precision lay between RTA 0] and 0-CFA 13911 81 . They 
experimented with abstract objects without fields and an unique reference representation 
(i.e., CTA analysis), abstract objects with fields and an unique reference representation 
(i.e., MTA analysis), abstract objects and one abstract reference per method (i.e., FTA 
analysis), and abstract objects with fields with one abstract reference per method (i.e., 
XTA analysis). The XTA analysis resulted in the best performance and precision tradeoff 
for call graph construction, their target application. 

The VTA analysis f45ll of the SABLE research project at McGill University specifi- 
cally contrasted the use of unique reference representatives versus the use of one abstract 
reference representative per class. The latter was found to be too imprecise to be of use. 

3.7 Directionality 

Reference analysis usually is formulated as constraints that describe the sets of objects 
to which a reference can point and how these sets are mutated by the semantics of var- 
ious assignments to (and through) reference variables and fields. There is a significant 
precision difference between symmetric and directional reference analyses, which are 
formulated as unification constraints or inclusion constraints, respectively. The unifica- 
tion constraints are similar to those used in Steensgaard’s pointer analysis for C ITTH ; the 
inclusion constraints are similar to those used by Andersen’s pointer analysis for C El. 

Precision differences between these constraint formulations for C pointer analysis 
were explained by Shapiro and Horwitz [ITtI . Considering the pointer assignment state- 
ment p = q, the unification analysis will union the points-to set of p with the points-to 
set of q, effectively saying both pointer variables can point to the same set of objects after 
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this assignment; this union is accomplished recursively, so that if *p is also a pointer 
then its set is unioned to that of *q. An inclusion analysis will conclude after this same 
assignment statement that the points-to set of p includes the points-to set of q, main- 
taining the direction of the assignment]^ Similar arguments can show why inclusion 
constraints can be expected to yield more a precise reference analysis solution than uni- 
fication constraints as was shown by Liang et. al ||27| . Ruf developed a context-sensitive 
analysis based on unification constraints as part of a redundant synchronization removal 
algorithm f36l| . 

Solution procedures for both types of constraints are polynomial time (in the size 
of the constraint set), but unification constraints can be solved in almost linear worst 
case cost m, whereas inclusion constraints have cubic worst case cost. Although these 
worst case costs are not necessarily experienced in practice, this difference has been 
considered significant until recently when newer techniques have shown that inclusion 
constraints in reference analysis can be solved effectively in practice I16I44I20I33I481 
ESI- Thus, it seems that the increased precision of inclusion constraints are worth the 
possible additional cost, but this may depend on the accuracy needs of the specific 
analysis application. 



4 Open Issues 

There still are open issues in the analysis of object-oriented languages for which solutions 
must be found. Some of them are listed below. 

- Reflection. Programs with reflection constructs can create objects, generate method 
calls, and access fields of objects at runtime whose declared types cannot be known at 
compile-time. This creates problems for analyses, because the program is effectively 
incomplete at compile-time. Most analyses transform a program to account for the 
effects of reflection before analyzing the program empirically. 

- Native methods. Calls to native methods (i.e., methods not written in the object- 
oriented language, often written in C) may have dataflow consequences that must 
be taken into account by a safe analysis. 

- Exceptions. In Java programs checked exceptions appear explicitly and unchecked 
exceptions appear implicitly; both can affect flow of control. Since obtaining a good 
approximation to possible program control flow is a requirement for a precise anal- 
ysis, some approaches have been tried I4QI719I10L but this is still an open problem. 

- Dynamic class loading. Dynamic class loading may invalidate the dynamic dis- 
patch function previously calculated by a reference analysis ll4^ . This suggests 
the possibility of designing an incremental reference analysis; however, it will be 
difficult to determine the previously-derived information that has been invalidated. 

- Incomplete programs. Often obj ect-oriented programs are either libraries or library 
clients, and thus partial programs. Analysis of such codes has been addressed 1147 1461 
l34l , but more work is needed. Having a good model for partial program analysis for 
object-oriented languages may allow analyses to be developed for component-based 

^ A combination of these constraints was used for C pointer analysis by Das and showed good 

precision in empirical experiments for practical cost HI3. 




134 



B.G. Ryder 



programs; it is likely however, that some reliance on component-provider-based 
information may be necessary. 

- Benchmarks. It is very important to use benchmark suites in testing analyses, be- 
cause reproducibility is required for strong empirical validation. Some researchers 
have used the SPEC compiler benchmarks 0 or have shared collected benchmark 
programs 0 



5 Conclusions 

Having presented an overview of the dimensions of precision in reference analysis of 
object-oriented languages, the current challenge in analysis research is to match the right 
analyses to specific client applications, with appropriate cost and precision. This task is 
aided by a clear understanding of the role of each dimension in the effectiveness of the 
resulting analysis solution. 

The nature of object-oriented languages is that programs are constructed from many 
small methods and that method calls (with possible recursion) are the primary control 
flow structure used. Thus, it is critical to include some type of context sensitivity in 
an analysis, to obtain sufficient precision for tasks beyond simple dynamic dispatch. 
Arguably, the functional approach offers a more practical mechanism than the call- 
string approach embodied in k-CFA analyses and it seems to be more cost effective. It 
is also clear that a solution procedure using inclusion constraints can be practical and 
delivers increased precision over cheaper unification constraint resolution. 

These opinions are held after experimentation by the community with many dimen- 
sions of analysis precision. However, no one analysis every application and many of 
the analyses discussed will be applicable to specific problems because their precision is 
sufficient to do the job. A remaining open question is Can the analysis community deliver 
useful analyses for a problem at practical cost? The answer is yet to be determined. 
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Abstract. Polyglot is an extensible compiler framework that supports the easy 
creation of compilers for languages similar to Java, while avoiding code duplica- 
tion. The Polyglot framework is useful for domain- specific languages, exploration 
of language design, and for simplified versions of Java for pedagogical use. We 
have used Polyglot to implement several major and minor modifications to Java; 
the cost of implementing language extensions scales well with the degree to which 
the language differs from Java. This paper focuses on the design choices in Poly- 
glot that are important for making the framework usable and highly extensible. 
Polyglot source code is available. 



1 Introduction 

Domain-specific extension or modification of an existing programming language en- 
ables more concise, maintainable programs. However, programmers construct domain- 
specific language extensions infrequently because building and maintaining a compiler 
is onerous. Better technology is needed. This paper presents a methodology for the con- 
struction of extensible compilers and also an application of this methodology in our 
implementation of Polyglot, a compiler framework for creating extensions to Java El. 
Language extension or modification is useful for many reasons: 

- Security. Systems that enforce security at the language level may find it useful to 
add security annotations or rule out unsafe language constructs. 

- Static checking. A language might be extended to support annotations necessary 
for static verification of program correctness [|23l . more powerful static checking of 
program invariants 03, or heuristic methods 10. 

- Language design. Implementation helps validate programming language designs. 

- Optimization. New passes may be added to implement optimizations not performed 
by the base compiler or not permitted by the base language specification. 

- Style. Some language features or idioms may be deemed to violate good style but 
may not be easy to detect with simple syntactic analysis. 

- Teaching. Students may learn better using a language that does not expose them to 
difficult features (e.g., inner classes UHl) or confusing error messages El. 
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We refer to the original unmodified language as the base language', we call the modified 
language a language extension even if it is not backwards compatible. 

When developing a compiler for a language extension, it is clearly desirable to build 
upon an existing compiler for the base language. The simplest approach is to copy the 
source code of the base compiler and edit it in place. This may be fairly effective if the 
base compiler is carefully written, but it duplicates code. Changes to the base compiler — 
perhaps to fix bugs — may then be difficult to apply to the extended compiler. Without 
considerable discipline, the code of the two compilers diverges, leading to duplication 
of effort. 

Our approach is different: the Polyglot framework implements an extensible compiler 
for the base language Java 1 .4. This framework, also written in Java, is by default simply 
a semantic checker for Java. However, a programmer implementing a language extension 
may extend the framework to define any necessary changes to the compilation process, 
including the abstract syntax tree (AST) and semantic analysis. 

An important goal for Polyglot is scalable extensibility', an extension should require 
programming effort proportional only to the magnitude of the difference between the 
extended and base languages. Adding new AST node types or new compiler passes should 
require writing code whose size is proportional to the change. Language extensions often 
require uniformly adding new fields and methods to an AST node and its subclasses; 
we require that this uniform mixin extension be implementable without subclassing all 
the extended node classes. Scalable extensibility is a challenge because it is difficult 
to simultaneously extend both types and the procedures that manipulate them I30I38I . 
Existing programming methodologies such as visitors o improve extensibility but are 
not a complete solution. In this paper we present a methodology that supports extension 
of both compiler passes and AST nodes, including mixin extension. The methodology 
uses abstract factories, delegation, and proxies ca to permit greater extensibility and 
code reuse than in previous extensible compiler designs. 

Polyglot has been used to implement more than a dozen Java language extensions of 
varying complexity. Our experience using Polyglot suggests that it is a useful framework 
for developing compilers for new Java-like languages. Some of the complex extensions 
implemented are Jif [25||, which extends Java with security types that regulate infor- 
mation flow; PolyJ ( 23 , which adds bounded parametric polymorphism to Java; and 
JMatch (EH, which extends Java with pattern matching and iteration features. Compil- 
ers built using Polyglot are themselves extensible; complex extensions such as Jif and 
PolyJ have themselves been extended. The framework is not difficult to learn: users have 
been able to build interesting extensions to Java within a day of starting to use Polyglot. 
The Polyglot source code is available Q 

The rest of the paper is structured as follows. Section El gives an overview of the 
Polyglot compiler. Section |3]describes in detail our methodology for providing scalable 
extensibility. Other Polyglot features that make writing an extensible compiler conve- 
nient are described in Section [4l Our experience using the Polyglot system to build 
various languages is reported in Section |5] Related work on extensible compilers and 
macro systems is discussed in Section |6] and we conclude in Section |7] 



^ At http: //www. cs . Cornell . edu/Project s/polyglot 
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Fig. 1. Polyglot Architecture 



2 Polyglot Overview 

This section presents an overview of the various components of Polyglot and describes 
how they can be extended to implement a language extension. An example of a small 
extension is given to illustrate this process. 

2.1 Architecture 

A Polyglot extension is a source- to- source compiler that accepts a program written in 
a language extension and translates it to Java source code. It also may invoke a Java 
compiler such as j avac to convert its output to bytecode. 

The compilation process offers several opportunities for the language extension im- 
plementor to customize the behavior of the framework. This process, including the even- 
tual compilation to Java bytecode, is shown in Fig.ffl In the figure, the name Ext stands 
for the particular extended language. 

The first step in compilation is parsing input source code to produce an AST. Polyglot 
includes an extensible parser generator, PPG, that allows the implementor to define the 
syntax of the language extension as a set of changes to the base grammar for Java. 
PPG provides grammar inheritance [22], which can be used to add, modify, or remove 
productions and symbols of the base grammar. PPG is implemented as a preprocessor 
for the CUP LALR parser generator fTTlI . 

The extended AST may contain new kinds of nodes either to represent syntax added 
to the base language or to record new information in the AST. These new node types are 
added by implementing the Node interface and optionally subclassing from an existing 
node implementation. 

The core of the compilation process is a series of compilation passes applied to the 
abstract syntax tree. Both semantic analysis and translation to Java may comprise several 
such passes. The pass scheduler selects passes to run over the AST of a single source file, 
in an order defined by the extension, ensuring that dependencies between source files are 
not violated. Each compilation pass, if successful, rewrites the AST, producing a new 
AST that is the input to the next pass. Some analysis passes (e.g., type checking) may halt 
compilation and report errors instead of rewriting the AST. A language extension may 
modify the base language pass schedule by adding, replacing, reordering, or removing 
compiler passes. The rewriting process is entirely functional; compilation passes do 
not destructively modify the AST. More details on our methodology are described in 
Section[3l 

Compilation passes do their work using objects that define important characteristics 
of the source and target languages. A type system object acts as a factory for objects 
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1 tracked(F) class FileReader { 

2 FileReader (File f) [] -> [F] throws lOException [] { . . . } 

3 int readO [F] -> [F] throws lOException [F] { . . . } 

4 void close 0 [F] ->[]{.. . ; free this; } 

5 } 



Fig. 2. Example Coffer FileReader 



representing types and related constructs such as method signatures. The type system 
object also provides some type checking functionality. A node factory constructs AST 
nodes for its extension. In extensions that rely on an intermediate language, multiple 
type systems and node factories may be used during compilation. 

After all compilation passes complete, the usual result is a Java AST. A Java compiler 
such as j avac is invoked to compile the Java code to bytecode. The bytecode may contain 
serialized extension- specific type information used to enable separate compilation; we 
discuss separate compilation in more detail in Section |4] 

2.2 An Example: Coffer 

To motivate our design, we describe a simple extension of Java that supports some of the 
resource management facilities of the Vault language Gl- This language, called Coffer, 
is a challenge for extensible compilers because it makes substantial changes to both the 
syntax and semantics of Java and requires identical modifications to many AST node 
types. Coffer allows a linear capability, or key, to be associated with an object. Methods 
of the object may be invoked only when the key is held. A key is allocated when its 
object is created and deallocated by a free statement in a method of the object. The 
Coffer type system regulates allocation and freeing of keys to guarantee statically that 
keys are always deallocated. 

Fig.Elshows a small Coffer program declaring a FileReader class that guarantees 
the program cannot read from a closed reader. The annotation tracked (F) on line 1 
associates a key named F with instances of FileReader. Pre- and post-conditions on 
method and constructor signatures, written in brackets, specify how the set of held keys 
changes through an invocation. For example on line 2, the precondition [] indicates that 
no key need be held to invoke the constructor, and the postcondition [F] specifies that F 
is held when the constructor returns normally. The close method (line 4) frees the key; 
no subsequent method that requires F can be invoked. 

The Coffer extension is used as an example throughout the next section. It is im- 
plemented by adding new compiler passes for computing and checking held key sets 
at each program point. Coffer’s free statements and additional type annotations are 
implemented by adding new AST nodes and extending existing nodes and passes. 

3 A Methodology for Scalable Extensibility 

Our goal is a mechanism that supports scalable extension of both the syntax and semantics 
of the base language. The programmer effort required to add or extend a pass should be 
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proportional to the number of AST nodes non-trivially affected by that pass; the effort 
required to add or extend a node should be proportional to the number of passes the node 
must implement in an interesting way. When extending or overriding the behavior of 
existing AST nodes, it is often necessary to extend a node class that has more than one 
subclass. For instance, the Coffer extension adds identical pre- and post-condition syntax 
to both methods and constructors; to avoid code duplication, these annotations should 
be added to the common base class of method and constructor nodes. The programmer 
effort to make such changes should be constant, irrespective of the number of subclasses 
of this base class. Inheritance is the appropriate mechanism for adding a new field or 
method to a single class. However, adding the same member to many different classes 
can quickly become tedious. This is true even in languages with multiple inheritance: a 
new subclass must be created for every class affected by the change. Modifying these 
subclasses later requires making identical changes to each subclass. Mixin extensibility 
is a key goal of our methodology: a change that affects multiple classes should require 
no code duplication. 

Compilers written in object-oriented languages often implement compiler passes 
using the Visitor design pattern CD. However, visitors present several problems for 
scalable extensibility. In a non-extensible compiler, the set of AST nodes is usually 
fixed. The Visitor pattern permits scalable addition of new passes, but sacrifices scalable 
addition of AST node types. To allow specialization of visitor behavior for both the 
AST node type and the visitor itself, each visitor class implements a separate callback 
method for every node type. Thus, adding a new kind of AST node requires modifying 
all existing visitors to insert a callback method for the node. Visitors written without 
knowledge of the new node cannot be used with the new node because they do not 
implement the callback. The Visitor pattern also does not provide mixin extensibility. A 
separate mechanism is needed to address this problem. 

An alternative to the Visitor pattern is for each AST node class to implement a method 
for each compiler pass. However, this technique suffers from the dual problem: adding 
a new pass requires adding a method to all existing node types. 

The remainder of this section presents a mechanism that achieves the goal of scalable 
extensibility. We first describe our approach to providing mixin extensibility. We then 
show how our solution also addresses the other aspects of scalable extensibility. 

3.1 Node Extension Objects and Delegates 

We implement passes as methods associated with AST node objects; however, to provide 
scalable extensibility, we introduce a delegation mechanism, illustrated in Fig. El that 
enables orthogonal extension and method override of nodes. 

Since subclassing of node classes does not adequately address orthogonal extension 
of methods in classes with multiple subclasses, we add to each node object a field, labeled 
ext in Fig. E] that points to a (possibly null) node extension object. The extension 
object (Cof f erExt in the figure) provides implementations of new methods and fields, 
thus extending the node interface without subclassing. These members are accessed 
by following the ext pointer and casting to the extension object type. In the example, 
Cof f erExt extends Node with keyFlowO and checkKeysO methods. Each AST 
node class to be extended with a given implementation of these members uses the 
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node 



typeCheckQ {...} 
printO {...} 



NodePelegate | 
typeCheckQ {...} 
printQ {node.printQ;} 




possible extension 
of Coffer node 



Fig. 3. Delegates and extensions 



same extension object class. Thus, several node classes can be orthogonally extended 
with a single implementation, avoiding code duplication. Since language extensions can 
themselves be extended, each extension object has an ext field similar to the one located 
in the node object. In effect, a node and its extension object together can be considered 
a single node. 

Extension objects alone, however, do not adequately handle method override when 
the base language is extended multiple times. The problem is that any one of a node’s 
extension objects can implement the overridden method; a mechanism is needed to 
invoke the correct implementation. A possible solution to this problem is to introduce a 
delegate object for each method in the node interface. For each method, a field in the node 
points to an object implementing that method. Calls to the method are made through its 
delegate object; language extensions can override the method simply by replacing the 
delegate. The delegate may implement the method itself or may invoke methods in the 
node or in the node’s extension objects. 

Because maintaining one object per method is cumbersome, the solution used in 
Polyglot is to combine delegate objects and to introduce a single delegate field for each 
node object — illustrated by the del field in Fig. [3] This field points to an object imple- 
menting the entire Node interface, by default the node itself. To override a method, a 
language extension writer creates a new delegate object containing the new implemen- 
tation or code to dispatch to the new implementation. The delegate implements Node’s 
other methods by dispatching back to the node. Extension objects also contain a del 
field used to override methods declared in the extension object interface. 

Calls to all node methods are made through the del pointer, thus ensuring that the 
correct implementation of the method is invoked if the delegate object is replaced by 
a language extension. Thus, in our example, the node’s typeCheck method is invoked 
via n . del . typeCheckO ; the Coffer checkKeys method is invoked by following the 
node’s ext pointer and invoking through the extension object’s delegate: ( (Cof f erExt ) 
n . ext) . del . checkKeys () . An extension of Coffer could replace the extension ob- 
ject’s delegate to override methods declared in the extension, or it could replace the 
node’s delegate to override methods of the node. To access Coffer’s type-checking func- 
tionality, this new node delegate may be a subclass of Coffer’s node delegate class or 
may contain a pointer to the old delegate object. The overhead of indirecting through 
the del pointer accounts for less than 2% of the total compilation time. 
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3.2 AST Rewriters 

Most passes in Polyglot are structured as functional AST rewriting passes. Factoring out 
AST traversal code eliminates the need to duplicate this code when implementing new 
passes. Each pass implements an AST rewriter object to traverse the AST and invoke 
the pass’s method at each node. At each node, the rewriter invokes a visitChildren 
method to recursively rewrite the node’s children using the rewriter and to reconstruct 
the node if any of the children are modified. A key implementation detail is that when a 
node is reconstructed, the node is cloned and the clone is returned. Cloning ensures that 
class members added by language extensions are correctly copied into the new node. 
The node’s delegates and extensions are cloned with the node. 

Each rewriter implements enter and leave methods, both of which take a node 
as argument. The enter method is invoked before the rewriter recurses on the node’s 
children using visitChildren and may return a new rewriter to be used for rewriting 
the children. This provides a convenient means for maintaining symbol table information 
as the rewriter crosses lexical scopes; the programmer need not write code to explicitly 
manage the stack of scopes, eliminating a potential source of errors. The leave method 
is called after visiting the children and returns the rewritten AST rooted at the node. 



3.3 Scalable Extensibility 

A language extension may extend the interface of an AST node class through an extension 
object interface. For each new pass, a method is added to the extension object interface 
and a rewriter class is created to invoke the method at each node. For most nodes, a 
single extension object class is implemented to define the default behavior of the pass, 
typically just an identity transformation on the AST node. This class is overridden for 
individual nodes where non-trivial work is performed for the pass. 

To change the behavior of an existing pass at a given node, the programmer creates 
a new delegate class implementing the new behavior and associates the delegate with 
the node at construction time. Like extension classes, the same delegate class may be 
used for several different AST node classes, allowing functionality to be added to node 
classes at arbitrary points in the class hierarchy without code duplication. 

New kinds of nodes are defined by new node classes; existing node types are extended 
by adding an extension object to instances of the class. A factory method for the new 
node type is added to the node factory to construct the node and, if necessary, its delegate 
and extension objects. The new node inherits default implementations of all compiler 
passes from its base class and from the extension’s base class. The new node may provide 
new implementations using method override, possibly via delegation. Methods need be 
overridden only for those passes that need to perform non-trivial work for that node type. 

Fig.SIshows a portion of the code implementing the Coffer key-checking pass, which 
checks the set of keys held when control enters a node. The code has been simplified in 
the interests of space and clarity. At each node in the AST, the pass invokes through the 
del pointer the checkKeys method in the Coffer extension, passing in the set of held 
keys (computed by a previous data-flow analysis pass). Since most AST nodes are not 
affected by the key-checking pass, a default checkKeys method implemented in the base 
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class KeyChecker extends Rewriter { 

Node leave (Node n) { 

( (Cof f erExt) n. ext) . del . checkKeys (heldJkeys (n) ) ; 
return n; 

} 

} 

class CofferExt { 

Node node; CofferExt del; 

void checkKeys (Set held_keys) { /* empty */ } 

} 

class ProcedureCallExt extends CofferExt { 
void checkKeys (Set held_keys) { 

ProcedureCall c = (ProcedureCall) node; 

Coff erProcedureType p = (Cof f erProcedureType) c.calleeO; 
if (! heldJkeys . containsAll (p . entryKeys 0 ) ) 
error (p . entryKeys 0 + " not held at " + c) ; 

} 

} 

Fig. 4. Coffer key checking 



CofferExt class is used for these nodes. For other nodes, a non-trivial implementation 
of key checking is required. 

Fig. 0also contains an extension class used to compute the held keys for method and 
constructor calls. ProcedureCall is an interface implemented by the classes for three 
AST nodes that invoke either methods or constructors: method calls, new expressions, 
and explicit constructor calls (e.g., super ( ) ). All three nodes implement the checkKeys 
method identically. By using an extension object, we need only to write this code once. 



4 Other Implementation Details 

In this section we consider some aspects of the Polyglot implementation that are not 
directly related to scalable extensibility. 

Data-Flow Analysis. Polyglot provides an extensible data-flow analysis framework. 
In Java implementation, this framework is used to check the that variables are initialized 
before use and that all statements are reachable; extensions may perform additional 
data-flow analyses to enable optimizations or to perform other transformations. Polyglot 
provides a rewriter in the base compiler framework that constructs the control-flow graph 
of the program. Intraprocedural data-flow analyses can then be performed on this graph 
by implementing the meet and transfer functions for the analysis. 

Separate Compilation. Java compilers use type information stored in Java class files 
to support separate compilation. For many extensions, the standard Java type information 
in the class file is insufficient. Polyglot injects type information into class files that can 
be read by later invocations of the compiler to provide separate compilation. No code 
need be written for a language extension to use this functionality for its extended types. 
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Before performing Java code generation. Polyglot uses the Java serialization facility to 
encode the type information for a given class into a string, which is then compressed and 
inserted as a final static field into the AST for the class being serialized. When compiling 
a class, the first time a reference to another class is encountered, Polyglot loads the 
class file for the referenced class and extracts the serialized type information. The type 
information is decoded and may be immediately used by the extension. 

Quasiquoting. To generate Java output, language extensions translate their ASTs to 
Java ASTs and rely on the code generator of the base compiler to output Java code. To 
enable AST rewriting, we have used PPG to extend Polyglot’s Java parser with the ability 
to generate an AST from a string of Java code and a collection of AST nodes to substitute 
into the generated AST. This feature provides many of the benefits of quasiquoting in 
Scheme fT^l . 

5 Experience 

More than a dozen extensions of varying sizes have been implemented using Polyglot, 
for example: 

- Jif is a Java extension that provides information flow control and features to ensure 
the confidentiality and integrity of data [|26l . 

- Jif/split is an extension to Jif that partitions programs across multiple hosts based 
on their security requirements 

- PolyJ is a Java extension that supports bounded parametric polymorphism fTJ\ . 

- Param is an abstract extension that provides support for parameterized classes. This 
extension is not a complete language, but instead includes code implementing lazy 
substitution of type parameters. Jif, PolyJ, and Coffer extend Param. 

- JMatch is a Java extension that supports pattern matching and logic programming 
features EH. 

- Coffer, as previously described, adds resource management facilities to Java. 

- PAO (“primitives as objects”) allows primitive values to be used transparently as 
objects via automatic boxing and unboxing, 

- A covariant return extension restores the subtyping rules of Java 1.0 Beta f3^ in 
which the return type of a method could be covariant in subclasses. The language 
was changed in the final version of Java 1.0 d to require the invariance of return 
types. 

The major extensions add new syntax and make substantial changes to the language 
semantics. We describe the changes for Jif and PolyJ in more detail below. The simpler 
extensions, such as support for covariant return types, require more localized changes. 



5.1 Jif 

Jif is an extension to Java that permits static checking of information flow policies. In 
Jif, the type of a variable may be annotated with a label specifying a set of principals 
who own the data and a set of principals that are permitted to read the data. Labels are 
checked by the compiler to ensure that the information flow policies are not violated. 
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The base Polyglot parser is extended using PPG to recognize security annotations 
and new statement forms. New AST node classes are added for labels and for new state- 
ment and expression forms concerning security checks. The new AST nodes and nearly 
all existing AST nodes are also extended with security context annotations. These new 
fields are added to a Jif extension class. To implement information flow checking, a 
labelCheck method is declared in the Jif extension object. Many nodes do no work 
for this pass and therefore can inherit a default implementation declared in the base Jif 
extension class. Extension objects installed for expression and statement nodes override 
the labelCheck method to implement the security typing judgment for the node. Del- 
egates were used to override type checking of some AST nodes to disallow static fields 
and inner classes since they may provide an avenue for information leaks. 

Following label checking, the Jif AST is translated to a Java AST, largely by erasing 
security annotations. The new statement and expression forms are rewritten to Java 
syntax using the quasiquoting facility discussed in Section 111 

Jif/split further extends Jif to partition programs across multiple hosts based on their 
security requirements. The syntax of Jif is modified slightly to also support integrity an- 
notations. New passes, implemented in extension objects, partition the Jif/split program 
into several Jif programs, each of which will run on a separate host. 

5.2 PolyJ 

PolyJ is an extension to Java that supports parametric polymorphism. Classes and inter- 
faces may be declared with zero or more type parameters constrained by where clauses. 
The base Java parser is extended using PPG, and AST node classes are added for where 
clauses and for new type syntax. Further, the AST node for class declarations is extended 
via inheritance to allow for type parameters and where clauses. 

The PolyJ type system customizes the behavior of the base Java type system and in- 
troduces judgments for parameterized and instantiated types. A new pass is introduced to 
check that the types on which a parameterized class is instantiated satisfy the constraints 
for that parameter, as described in ^T7 \ . 

The base compiler code generator is extended to generate code not only for each 
PolyJ source class, but also an adapter class for each instantiation of a parameterized 
class. 



5.3 Results 

As a measure of the programmer effort required to implement the extensions discussed 
in this paper, the sizes of the code for these extensions are shown in Table[l] To eliminate 
bias due to the length of identifiers in the source, sizes are given in number of tokens for 
source files, including Java, CUP, and PPG files. 

These results demonstrate that the cost of implementing language extensions scales 
well with the degree to which the extension differs from its base language. Simple exten- 
sions such as the covariant return extension that differ from Java in small, localized ways 
can be implemented by writing only small amounts of code. To measure the overhead of 
simply creating a language extension, we implemented an empty extension that makes 
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Table 1. Extension size 



Extension 


Token count 


Percent of Base Polyglot 


base Polyglot 


164136 


100% 


Jif 


126188 


77% 


JMatch 


105269 


64% 


PolyJ 


78159 


48% 


Coffer 


21251 


13% 


PAO 


3422 


2% 


Parana 


3233 


2% 


covariant return 


1562 


1% 


empty 


691 


< 1% 



no changes to the Java language; the overhead includes empty subclasses of the base 
compiler node factory and type system classes, an empty PPG parser specification, and 
code for allocating these subclasses. 

PolyJ, which has large changes to the type system and to code generation, requires 
only about half as much code as the base Java compiler. For historical reasons, PolyJ 
generates code by overriding the Polyglot code generator to directly output Java. The 
size of this code could be reduced by using quasiquoting. Jif requires a large amount of 
extension code because label checking in Jif is more complex than the Java type checking 
that it extends. Much of the JMatch overhead is accounted for by extensive changes to 
add complex statement and expression translations. 

As a point of comparison, the base Polyglot compiler (which implements Java 1.4) 
and the Java 1.1 compiler, javac, are nearly the same size when measured in tokens. 
Thus, the base Polyglot compiler implementation is reasonably efficient. To be fair to 
j avac, we did not count its code for bytecode generation. About 10% of the base Polyglot 
compiler consists of interfaces used to separate the interface hierarchy from the class 
hierarchy. The j avac compiler is not implemented this way. 

Implementing small extensions has proved to be fairly easy. We asked a program- 
mer previously unfamiliar with the framework to implement the covariant return type 
extension; this took one day. The same programmer implemented several other small 
extensions within a few days. 



5.4 Discussion 

In implementing Polyglot we found, not surprisingly, that application of good object- 
oriented design principles greatly enhances Polyglot’s extensibility. Rigorous separa- 
tion of interfaces and classes permit implementations to be more easily extended and 
replaced; calls through interfaces ensure the framework is not bound to any particular 
implementation of an interface. The Polyglot framework almost exclusively uses factory 
methods to create objects Ha, giving language extensions more freedom to change the 
implementation provided by the base compiler by avoiding explicitly tying code to a 
particular class. 
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We chose to implement Polyglot using only standard Java features, but it is clear that 
several language extensions — some of which we have implemented using Polyglot — 
would have made it easier to implement Polyglot. Multimethods (e.g., 0]) would have 
simplified the dispatching mechanism needed for our methodology. Open classes |S| 
might provide a cleaner solution to the extensibility problem, particularly in conjunc- 
tion with multimethods. Aspect-oriented programming |[20ll is another technique for 
adding and overriding methods in an existing class hierarchy. Hierarchically extensible 
datatypes and functions fl5\ offer another solution to the extensibility problem. Multi- 
ple inheritance and, in particular, mixins (e.g., 1411 IP ) would facilitate application of an 
extension to many AST nodes at once. Built-in quasiquoting support would make transla- 
tion more efficient, though the need to support several target languages would introduce 
some difficulties. Covariant modification of method return types would eliminate many 
unnecessary type casts, as would parametric polymorphism B27I28II . 

6 Related Work 

There is much work that is related to Polyglot, including other extensible compilers, 
macro systems, and visitor patterns. 

JaCo is an extensible compiler for Java written in an extended version of Java Q21 
that supports ML- style pattern matching. JaCo does not provide mixin extensibility. It 
relies on a new language feature — extensible algebraic datatypes [HS1 — to address the 
difficulty of handling new data types without changing existing code. Polyglot achieves 
scalable extensibility while relying only on features available in Java. 

CoSy im is a framework for combining compiler phases to create an optimizing 
compiler. Compiler phases can be added and reused in multiple contexts without chang- 
ing existing code. The framework was not designed for syntax extension. In the SUIF 
compiler O, data structures can be extended with annotations, similar to Polyglot’s ex- 
tension objects; new annotations are ignored by existing compiler passes. Scorpion EH 

is a meta-programming environment that has a similar extension mechanism. Neither 
SUIF nor Scorpion have a mechanism like Polyglot’s delegate objects to mix in method 
overrides. 

JastAdd Ci is a compiler framework that uses aspect-oriented programming to add 
methods and fields into the AST node class hierarchy to implement new passes or to 
override existing passes. The AST node hierarchy may be extended via inheritance, but 
duplicate code may need to be written for each pass to support new nodes. 

Macro systems and preprocessors are generally concerned only with syntactic ex- 
tensions to a language. Recent systems for use in Java include EPP ifTFIl . JSE iV2] . and 
jpp EU. Maya ii is a generalization of macro systems that uses generic functions 
and multimethods to allow extension of Java syntax. Semantic actions can be defined 
as multimethods on those generic functions. It is not clear how these systems scale to 
support semantic checking for large extensions to the base language. 

The Jakarta Tools Suite (JTS) E is a toolkit for implementing Java preprocessors 
to create domain-specific languages. Extensions of a base language are encapsulated 
as components that define the syntax and semantics of the extension. A fundamental 
difference between JTS and Polyglot is that JTS is concerned primarily only the syntactic 
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analysis of the extension language, not with semantic analysis O section 4]. This makes 
ITS more like a macro system in which the macros are defined by extending the compiler 
rather than declaring them in the source code. 

OpenJava ll34l uses a meta-object protocol (MOP) similar to Java’s reflection API 
to allow manipulation of a program’s structure. OpenJava allows very limited extension 
of syntax, but through its MOP exposes much of the semantic structure of the program. 

The original Visitor design pattern da has led to many refinements. Extensible Visi- 
tors m and Staggered Visitors f35l| both enhance the extensibility of the visitor pattern 
to facilitate adding new node types, but neither these nor the other refinements men- 
tioned above support mixin extensibility. Staggered Visitors rely on multiple inheritance 
to extend visitors with support for new nodes. 



7 Conclusions 

Our original motivation for developing the Polyglot compiler framework was simply to 
provide a publicly available Java front end that could be easily extended to support new 
languages. We discovered that the existing approaches to extensible compiler construc- 
tion within Java did not solve to our satisfaction the problem of scalable extensibility 
including mixins. Our extended visitor methodology is simple, yet improves on the pre- 
vious solutions to the extensibility problem. Other Polyglot features such as extensible 
parsing, pass scheduling, quasiquoting, and type signature insertion are also useful. Our 
experience using Polyglot has shown that it is an effective way to produce compilers 
for Java-like languages. We have used the framework for several significant language 
extensions that modify Java syntax and semantics in complex ways. We hope that the 
public release of this software in source code form will facilitate experimentation with 
new features for object-oriented languages. 
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Abstract. Most points-to analysis research has been done on different systems 
by different groups, making it difficult to compare results, and to understand 
interactions between individual factors each group studied. Furthermore, points- 
to analysis for Java has been studied much less thoroughly than for C, and the 
tradeoffs appear very different. We introduce Spark, a flexible framework for 
experimenting with points-to analyses for Java. Spark supports equality- and 
subset-based analyses, variations in field sensitivity, respect for declared types, 
variations in call graph construction, off-line simplification, and several solving 
algorithms. Spark is composed of building blocks on which new analyses can be 
based. 

We demonstrate Spark in a substantial study of factors affecting precision and 
efficiency of subset-based points-to analyses, including interactions between these 
factors. Our results show that Spark is not only flexible and modular, but also 
offers superior time/space performance when compared to other points-to analysis 
implementations. 



1 Introduction 

Many compiler analyses and optimizations, as well as program understanding and ver- 
ification tools, require information about which objects each pointer in a program may 
point to at run-time. The problem of approximating these points-to sets has been the 
subject of much research; however, many questions remain unanswered flTFI . As with 
many compiler analyses, a precision vs. time trade-off exists for points-to analysis. For 
analyzing programs written in C, many points between the extremes of high-precision, 
slow and low-precision, fast have been explored These analyses 

have been implemented as parts of distinct systems, so it is difficult to compare and 
combine their unique features. The design tradeoffs for doing points-to analysis for Java 
appear to be different than for C, and recently, several different approaches to points-to 
analysis for Java have been suggested IITlEniEZI- However, once again, it is hard to 
compare the results since each group has implemented their analysis in a different sys- 
tem, and has made very different assumptions about how to handle the large Java class 
libraries and Java native methods. 

To address these issues, we have developed the Soot Pointer Analysis Research Kit 
(Spark), a flexible framework for experimenting with points-to analyses for Java. Spark 
is very modular: the pointer assignment graph that it produces and simplifies can be used 
as input to other solvers, including those being developed by other researchers. We hope 
that this will make it easier for researchers to compare results. In addition, the correctness 
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of new analyses can be verified by comparing their results to those computed by the basic 
analyses provided in Spark. 

In order to demonstrate the usefulness of the framework, we have also performed a 
substantial empirical study of a variety of subset-based points-to analyses using Spark. 
We studied a wide variety of factors that affect both precision and time/space costs. 
Our results show that Spark is not only flexible and modular, but also offers very good 
time/space performance when compared to other points-to analysis implementations. 

Specific new contributions of this paper are as follows. (1) The Spark framework 
itself is available as part of Soot 1.2.4 iia and later releases under the LGPL for the 
use of all researchers. (2) We present a study of a variety of representations for points- 
to sets and of a variety of solving strategies, including an incremental, worklist-based, 
field-sensitive algorithm which appears to scale well to larger benchmarks. (3) We report 
on an empirical evaluation of many factors affecting the precision, speed, and memory 
requirements of subset-based points-to analysis algorithms. We focus on improving the 
speed of the analysis without significant loss of precision. (4) We make recommendations 
to allow analyses to scale to programs on the order of a million lines of code. Even trivial 
Java programs are becoming this large as the standard class library grows. 

The structure of this paper is as follows. In Section |2| we examine some of the 
challenges and factors to consider when designing an effective points-to analysis for Java. 
In Section[3]we introduce the Spark framework and discuss the important components. 
Section|4]shows Spark in action via a large empirical study of a variety of subset-based 
pointer analyses. In Section |5]we discuss related work and in Section [6] we provide our 
conclusions and discuss future work. 



2 Points-to Analysis for Java 

Although some of the techniques developed for C have been adapted to Java, there are 
significant differences between the two languages that affect points-to analysis. In C, 
points-to analysis can be viewed as two separate problems: analysis of stack-directed 
pointers, and analysis of heap-directed pointers. Most C programs have many more 
occurrences of the address-of (&) operator, which creates stack-directed pointers, than 
dynamic allocation sites, which create heap-directed pointers. It is therefore important 
for C points-to analyses to deal well with stack-directed pointers. Java, on the other 
hand, allows no stack-directed pointers whatsoever, and Java programs usually have 
many more dynamic allocation sites than C programs of similar size. Java analyses 
therefore have to handle heap-directed pointers well. Another important difference is 
the strong type checking in Java, which limits the sets of objects that a pointer could 
point to, and can therefore be used to improve analysis speed and precision. Diwan et. 
al. have shown the benefits of type-based alias analysis for Modula-3 ina. Our study 
shows that using types in Java is very useful for improving efficiency, and also results 
in a small improvement in precision. 

The object-oriented nature of Java also introduces new complexities in dealing with 
any whole program analysis. In order to build a call graph, some approximation of 
the targets of virtual method calls must be used. There are two basic approaches. The 
first approach is to use an approximation of the call graph built by another analysis. 
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The second approach is to construct the call graph on-the-fly, as the pointer analysis 
proceeds. In our empirical study, Section |4l we compare the two approaches. 

Related to the problem of finding a call graph is finding the set of methods that 
must be analyzed. In sequential C programs, there is one entry point, main, and a whole 
program analysis can start at this entry point and then incrementally (either ahead-of-time 
or during analysis) add all called methods. In Java the situation is much more complicated 
as there are many potential entry points including static initializers, finalizers, thread start 
methods, and methods called using reflection. Further complicating matters are native 
methods which may impact points-to analysis, but for which we do not have the code to 
analyze. Our Spark framework addresses these points. 

Another very important point is the large size of the Java libraries. Even small ap- 
plication programs may touch, or appear to touch, a large part of the Java library. This 
means that a whole program analysis must be able to handle large problem sizes. Exist- 
ing points-to analyses for Java have been successfully tested with the 1.1.8 version of 
the Java standard libraries HTTIEUI . consisting of 148 thousand lines of code (KLOC). 
However, current versions of the standard library are over three times larger (eg. 1 .3. 1 _01 
is 574 KLOC), dwarfing most application programs that use them, so it is not clear that 
existing analyses would scale to such large programs. Our framework has been designed 
to provide the tools to develop efficient and scalable analyses which can effectively 
handle large benchmarks using the large libraries. 

3 SPARK Framework 

3.1 Overview 

The Soot Pointer Analysis Research Kit (Spark) is a flexible framework for experiment- 
ing with points-to analyses for Java. Although Spark is very competitive in efficiency 
with other points-to analysis systems, the main design goal was not raw speed, but rather 
the flexibility to make implementing a wide variety of analyses as easy as possible, to 
facilitate comparison of existing analyses and development of new ones. 

Spark supports both subset-based m and equality-based ll22l analyses, as well as 
variations that lie between these two extremes. In this paper, we focus on the more 
precise, subset-based analyses. Although Spark is limited to flow-insensitive analyses, 
most of the benefit of flow-sensitivity is obtained by splitting variables. 

Spark is implemented as part of the Soot bytecode analysis, optimization, and anno- 
tation framework lES]. Soot accepts Java bytecode as input, converts it to one of several 
intermediate representations, applies analyses and transformations, and converts the re- 
sults back to bytecode. Spark uses as its input the Jimple intermediate representation, a 
three-address representation in which local (stack) variables have been split according to 
DU-UD webs, and declared types have been inferred for them. The results of Spark can 
be used by other analyses and transformations in Soot. Soot also provides an annotation 
framework that can be used to encode the results in classfile annotations for use by other 
tools or runtime systems (H. 

The execution of Spark can be divided into three stages: pointer assignment graph 
construction, pointer assignment graph simplification, and points-to set propagation. 
These stages are described in the following subsections. 
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3.2 Pointer Assignment Graph 

Spark uses a pointer assignment graph as its internal representation of the program being 
analyzed. The first stage of Spark, the pointer assignment graph builder, constructs the 
pointer assignment graph from the Jimple input. Separating the builder from the solver 
makes it possible to use the same solution algorithms and implementations to solve 
different variations of the points-to analysis problem. 

The pointer assignment graph consists of three types of nodes. Allocation site nodes 
represent allocation sites in the source program, and are used to model heap locations. 
Simple variable nodes represent local variables, method parameters and return values, 
and static fields. Field dereference nodes represent field access expressions in the source 
program; each is parametrized by a variable node representing the variable being deref- 
erenced by the field access. The nodes in the pointer assignment graph are connected 
with four types of edges reflecting the pointer flow, corresponding to the four types of 
constraints imposed by the pointer-related instructions in the source program (Table [TJ. 
In this table, a and b denote allocation site nodes, src and dst denote variable nodes, 
and src.f and dst.f denote field dereference nodes. 



Table 1. The four types pointer assignment graph edges. 





Allocation 


Assignment 


Field store 


Field load 


Instruction 


a : dst := new C 


dst := src 


dst.f := src 


dst := src.f 


Edge 


a dst 


src dst 


src dst.f 


src.f dst 


Rules 


a dst 

a G pt(dst) 


src dst 

a G pt{src) 


src dst.f 
a G pt{src) 
b G pt(dst) 
a € pt(b.f) 


src.f dst 

a G pt{src) 
b e pt{a.f) 
b G pt(dst) 


a G pt(dst) 



Later, during the propagation of points-to sets, a fourth type of node (denoted a.f and 
b.f) is created to hold the points-to set of each field of objects created at each allocation 
site. These nodes are parameterized by allocation site and field. However, they are not 
part of the initial pointer assignment graph. 

Depending on the parameters to the builder, the pointer assignment graph for the 
same source code can be very different, reflecting varying levels of precision desired of 
the points-to analysis. As an example, the builder may make assignments directed for a 
subset-based analysis, or bi-directional for a equality-based analysis. Another example 
is the representation of field dereference expressions in the graph, as discussed next. 



Field Dereference Expressions: A field expression p.f refers to the field / of the 
object pointed to by p. There are three standard ways of dealing with fields. A field- 
sensitive interpretation, which is the most precise, considers p.f to represent only the 
field / of only objects in the points-to set of p. A less pr^cist, field-based interpretation 
approximates each field / of all objects using a single set, ignoring the p. The key 
advantage of this is that points-to sets can be propagated along a pointer assignment 
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graph of only simple variable nodes in one single iteration, by first merging strongly- 
connected components of nodes, then propagating in topological order. Many C points-to 
analyses use di field-independent interpretation, which ignores the /, and approximates 
all the fields of objects in the points-to set of p as a single location. In Java, the field 
information is readily available, and different fields are guaranteed not to be aliased, so 
a field-independent interpretation makes little sense. Spark supports field- sensitive and 
field-based analyses, and field-independent analyses would be trivial to implement. 

3.3 Call Graph Construction 

An interprocedural points-to analysis requires an approximation of the call graph. This 
can be constructed in advance using a technique such as CHA El, RTA [C] or VTA [1^ . 
or it can be constructed on-the-fiy as the points-to sets of call site receivers are computed. 
The latter approach gives somewhat higher precision, but requires more iteration as edges 
are added to the pointer assignment graph. 

Spark supports all of these variations, but in this paper, our empirical study focuses 
on CHA and on-the-fly call graph construction. Spark always uses the CHA call graph 
builder included in Soot to determine which methods are reachable for the purposes of 
building the pointer assignment graph. However, on-the-fiy call graph construction can 
be achieved at solving time by excluding interprocedural edges from the initial graph, 
and then adding only the reachable edges as the points-to sets are propagated. 

In theory, determining which methods are possibly reachable at run-time is simple: 
start with a root set containing the main method, and transitively add all methods which 
are called from methods in the set. Java is not this simple, however; execution can 
also start at static initializers, finalizers, thread start methods, and dynamic call sites 
using reflection. Soot considers all these factors in determining which methods are 
reachable. For the many call sites using reflection inside the standard class library, we 
have compiled, by hand, a list of their possible targets, and they are automatically added 
to the root set. 

In addition, native methods may affect the flow of pointers in a Java program. Spark 
therefore includes a native method simulation framework. The effects of each native 
method are described in the framework using abstract Java code, and Spark then creates 
the corresponding pointer flow edges. The native method simulation framework was 
designed to be independent of Spark, so the simulations of native methods should be 
usable by other analyses. 

3.4 Points-to Assignment Graph Simplification 

Before points-to sets are propagated, the pointer assignment graph can be simplified by 
merging nodes that are known to have the same points-to set. Specifically, all the nodes 
in a strongly-connected component (cycle) will have equal points-to sets, so they can 
be merged to a single node. A version of the off-line variable substitution algorithm 
given in IfTyiis also used to merge equivalence sets of nodes that have a single common 
predecessorlll 

^ If types are being used, then only nodes with compatible types can be merged; the interaction 
of types and graph simplication is examined in Section U 



158 



O. Lhotak and L. Hendren 



Spark uses a fast union-find algorithm to merge nodes in time almost linear 
in the number of nodes. This is the same algorithm used for equality-based HTH analy- 
ses. Therefore, by making all edges bidirectional and merging nodes forming strongly- 
connected components, we can implement a equality-based analysis in Spark. In fact, 
we can easily implement a hybrid analysis which is partly equality-based and partly 
subset-based by making only some of the edges bidirectional. One instance of a sim- 
ilarly hybrid analysis is described in (1. Even when performing a fully subset-based 
analysis, we can use the same unification code to simplify the pointer assignment graph. 

3.5 Set Implementations 

Choosing an appropriate set representation for the points-to sets is a key part of designing 
an effective analysis. The following implementations are currently included as part of 
Spark; others should be easy to add. Hash Set is a wrapper for the HashSe t implemen- 
tation from the standard class library. It is provided as a baseline against which the other 
set implementations can be compared. Sorted Array Set implements a set using an array 
which is always kept in sorted order. This makes it possible to compute the union of two 
sets in linear time, like in a merge sort. Bit Set implements a set as a bit vector. This 
makes set operations very fast regardless of how large the sets get (as long as the size of 
the universal set stays constant). The drawback is that the many sparse sets use a large 
amount of memory. Hybrid Set represents small sets (up to 16 elements) explicitly using 
pointers to the elements themselves, but switches to a bit vector representation when the 
sets grow larger, thus allowing both small and large sets to be represented efficiently. 



3.6 Points-to Set Propagation 

After the pointer assignment graph has been built and simplified, the final step is propa- 
gating the points-to sets along its edges according to the rules shown in Table [T] Spark 
provides several different algorithms to implement these rules. 



Iterative Algorithm: Spark includes a naive, baseline, iterative algorithm (Algo- 
rithm [T) that can be used to check the correctness of the results of the more complicated 
algorithms!^ Note that for efficiency, all the propagation algorithms in Spark consider 
variable nodes in topological order (or pseudo-topological order, if cycles have not been 
simplified). 



Worklist Algorithm: For some of our benchmarks, the iterative algorithm performs 
over 60 iterations. After the first few iterations, the points-to sets grow very little, yet 
each iteration is as expensive as the first few. A better, but more complex solver based on 
worklists is also provided as part of Spark and is outlined in Algorithm^ This solver 
maintains a worklist of variable nodes whose points-to sets need to be propagated to 
their successors, so that only those nodes are considered for propagation. 

^ For clarity, the algorithms are presented here without support for on-the-fly call graph construc- 
tion. However, both variations are implemented in Spark and evaluated in Section[4] 
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Algorithm 1 Iterative Propagation 



1 


initialize sets according to allocation edges 


2 


repeat 


3 


propagate sets along each assignment edge p ^ g 


4 


for each load edge p.f q do 


5 


for each a G pt(p) do 


6 


propagate sets pt{a.f) pt(q) 


7 


for each store edge p ^ q.f do 


8 


for each a G pt(g) do 


9 


propagate sets pt(p) pt {a. f) 


10 


: until no changes 



Algorithm 2 Worklist Propagation 

1 : for each allocation edge oi p do 
2: pt{p) = {oi} 

3: add p to worklist 

4: repeat 
5: repeat 

6: remove first node p from worklist 

7: propagate sets along each assignment edge p ^ q, 

adding q to worklist whenever pt(g) changes 
8: for each store edge q ^ r.f where p — qorp — r do 

9: for each a e pt{r) do 

10: propagate sets pt{q) pt{a.f) 

1 1 : for each load edge p.f ^ q do 

12: for each a e pt(jp) do 

13: propagate sets pt(a./) ^ q 

14: add q to worklist if pt{q) changed 

15: until worklist is empty 

16: for each store edge q ^ r.f do 

17: for each a G pt(r) do 

18: propagate sets pt{q) pt{a.f) 

19: for each load edge p.f q do 

20: for each a G pt{p) do 

21: propagate sets pt{a.f) q 

22: add q to worklist if pt(q) changed 

23: until worklist is empty 



In the presence of field- sensitivity, however, the worklist algorithm is not so simple. 
Whenever a variable node p appears in the worklist (which means that its points-to set 
has new nodes in it that need to be propagated), the algorithm propagates along edges of 
the form p g, but also along loads and stores involving p (those of the form p — > g./, 
g -> p./, andp./ g), since they are likely to require propagation. However, this is not 
sufficient to obtain the complete solution. For example, suppose that a is in the points-to 
sets of both p and g, so that p and g are possible aliases. After processing any store into 
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q.f, we should process all loads from p.f. However, there is no guarantee that p will 
appear in the worklist. For this reason, the algorithm must still include an outer iteration 
over all the load and store edges. To summarize, lines 16 to 22 in the outer loop are 
necessary for correctness; lines 8 to 14 could be removed, but including them greatly 
reduces the number of iterations of the outer loop and therefore reduces the analysis 
time. 



Incremental Sets: In certain implementations of sets (hash set and sorted array set), 
each set union operation takes time proportional to the size of the sets being combined. 
While iterating through an analysis, the contents of one set are repeatedly merged into 
the contents of another set, often adding only a small number of new elements in each 
iteration. We can improve the algorithm by noting that the elements that have already 
been propagated must be found in the set in every subsequent iteration. 

Thus, as an optional improvement. Spark includes versions of the solvers that use 
incremental sets. Each set is divided into a “new” part and an “old” part. During each 
iteration, elements are propagated only between the new parts, which are likely to be 
small. At the end of each iteration, all the new parts are flushed into their corresponding 
old part. An additional advantage of this is that when constructing the call graph on-the- 
fly, only the smaller, new part of the points-to set of the receiver of each call site needs 
to be considered in each iteration. 

4 Using SPARK for Subset-Based Points-to Analysis 

In order to demonstrate that Spark provides a general and effective means to express 
different points-to analyses, we have done an extensive empirical study of a variety 
of subset-based points-to analyses. By expressing many different variations within the 
same framework we can measure both precision and cost of the analyses. 

4.1 Benchmarks 

We tested Spark on benchmarks from the SPECjvm 0 suite, along with sablecc 
and soot from the Ashes |[T] suite, and jedit 0, a full-featured editor written in 
Java. The last three were selected because they are non-trivial Java applications used in 
the real world, and they were also used in other points-to analysis studies H2()ll27l[T7l . 
The complete list of benchmarks appears in the summary in Table [3| at the end of this 
section, along with some characteristics of the benchmarks, and measurements of the 
effectiveness of Spark on them. All benchmarks were analyzed with the Sun JDK 
1.3.1_01 standard class library, on a 1.67 GHz AMD Athlon with 2GB of memory 
running Linux 2.4.18. In addition, we also tested the javac benchmark with the Sun 
JDK 1.1.8 standard class library for comparison with other studies. 

We chose four representative benchmarks for which to present the detailed results 
of our experiments on individual factors affecting precision and efficiency of points- 
to analysis. We chose compress as a small SPECjvm benchmark, javac as a large 
SPECjvm benchmark, and sablecc and jedit as large non-SPECjvm benchmarks 
written by distinct groups of people. We observed similar trends on the other benchmarks. 
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Table 2. Analysis precision. 





Dereference Sites (% of total) 


Call Sites (% of total) 
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compress nt-otf-fs 


35.2 


23.4 


6.3 


14.1 


5.9 


0.1 


14.9 


53.8 


42.6 


1.6 


1.9 


at-otf-fs 


35.3 


32.7 


8.0 


17.4 


4.3 


2.2 


0.0 


53.8 


42.6 


1.6 


1.9 


ot-otf-fs 


36.9 


32.1 


7.8 


17.0 


4.3 


1.8 


0.0 


54.6 


42.3 


1.3 


1.8 


ot-cha-fs 


20.5 


39.6 


10.1 


21.8 


6.0 


2.1 


0.0 


40.8 


51.7 


2.6 


4.9 


ot-otf-fb 


26.3 


38.1 


9.4 


19.2 


5.1 


1.9 


0.0 


48.0 


47.4 


2.0 


2.6 


ot-cha-fb 


16.0 


41.6 


10.9 


22.9 


6.4 


2.2 


0.0 


37.5 


54.3 


2.9 


5.2 


javac nt-otf-fs 


31.4 


22.2 


6.0 


12.9 


5.8 


6.4 


15.2 


50.1 


45.3 


1.9 


2.7 


at-otf-fs 


31.6 


33.9 


8.7 


17.7 


5.7 


2.4 


0.0 


50.1 


45.3 


1.9 


2.7 


ot-otf-fs 


33.0 


33.3 


8.6 


17.3 


5.7 


2.0 


0.0 


50.8 


45.2 


1.5 


2.5 


ot-cha-fs 


18.4 


40.0 


10.5 


21.5 


7.2 


2.3 


0.0 


38.0 


53.9 


2.6 


5.5 


ot-otf-fb 


23.6 


38.6 


10.0 


19.2 


6.5 


2.1 


0.0 


44.6 


49.9 


2.1 


3.3 


ot-cha-fb 


14.5 


41.7 


11.3 


22.5 


7.6 


2.4 


0.0 


34.9 


56.3 


3.0 


5.8 


sablecc nt-otf-fs 


31.6 


24.2 


5.9 


12.7 


9.5 
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15.8 


49.9 


45.8 


2.1 


2.2 


at-otf-fs 


31.7 


37.9 


7.4 


16.2 


4.9 


2.0 
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49.9 


45.8 


2.1 


2.2 


ot-otf-fs 


33.1 


37.4 


7.3 


15.7 
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ot-otf-fb 


23.6 


42.6 


8.7 


17.7 


5.7 


1.7 


0.0 


44.7 


50.3 


2.2 


2.8 


ot-cha-fb 


14.4 


45.8 


10.0 


21.0 


6.8 


1.9 


0.0 


34.9 


56.6 


3.3 


5.2 


jedit nt-otf-fs 


25.6 


29.6 


6.6 


12.7 


3.8 


1.5 


20.2 


43.8 


52.0 


1.9 


2.2 


at-otf-fs 


25.7 


42.4 


9.0 


16.3 


4.7 


2.0 


0.0 


43.8 


52.0 


1.9 


2.2 


ot-otf-fs 


27.1 


42.0 


8.9 


15.9 


4.3 


1.9 


0.0 


44.6 


51.9 


1.4 


2.1 


ot-cha-fs 


14.5 


47.9 


10.7 


19.4 


5.5 


2.1 


0.0 


33.2 


59.3 


2.3 


5.1 


ot-otf-fb 


18.9 


46.7 


10.0 


17.6 


4.8 


2.0 


0.0 


38.6 


56.7 


1.9 


2.8 


ot-cha-fb 


12.1 


49.0 


11.0 


20.1 


5.7 


2.1 


0.0 


30.7 


61.5 


2.5 


5.3 



4.2 Factors Affecting Precision 

We now discuss three factors that affect not only the efficiency of the analysis, but also 
the precision of its result. These factors are: (1) how types are used in the analysis, (2) 
whether we use a CHA-based call graph or build the call graph on the fly, and (3) whether 
the analysis is field- sensitive or field-based. 

Table [2] gives the results. For each benchmark we experiment with five different 
points-to analyses, where each analysis is named by a triple of the form xx-yyy-zz which 
specifies the setting for each of the three factors (a complete explanation of each factor 
is given in the subsections below). For each benchmark/points-to analysis combination, 
we give a summary of the precision for dereference sites and call sites. 

For dereference sites, we consider all occurrences of references of the form p . f and 
we give the percentage of dereference sites with 0, 1, 2, 3-10, 11-100, 101-1000 and 
more than 1000 elements in their points-to sets. Dereference sites with 0 items in the set 
correspond to statements that cannot be reached (i.e. the CHA call graph conservatively 
indicated that it was in a reachable method, but no allocation ever flows to that statement). 
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For call sites, we consider all invokevirtual and invokeinterf ace calls 
and report the percentage of such call sites with with 0, 1,2, and more than two target 
methods, where the target methods are found using the types of the allocation sites 
pointed to by the receiver of the method call. For example, for a call of the form o . m ( ) , 
the types of allocation sites pointed to by o would be used to find the target methods. 
Calls with 0 targets correspond to unreachable calls and calls with 1 target are guaranteed 
to be monomorphic at run-time. 

Note that since the level of precision required is highly dependent on the application 
of the points-to results, this table is not intended to be an absolute measure of precision; 
rather, we present it only to give some idea of the relative precision of different analysis 
variations, and to give basic insight into the effect that different levels of precision have 
on the analysis. 



Respecting Declared Types: Unlike in C, variables in Java are strongly-typed, limit- 
ing the possible set of objects to which a pointer could point. However, many points-to 
analyses adapted from C do not take advantage of this. For example, the analyses de- 
scribed in [|^|3] ignore declared types as the analysis proceeds; however, objects of 
incompatible type are removed after the analysis completes. 

The first three lines of each benchmark in Table [2] show the effect of types. The first 
line shows the precision of an analysis in which declared types are ignored, notypes 
(abbreviated nt). The second line shows the results of the same analysis after objects 
of incompatible type have been removed after completion of the analysis, aftertypes 
(abbreviated at). The third line shows the precision of an analysis in which declared 
types are respected throughout the analysis, on-the-fly types (abbreviated Ot). 

We see that removing objects based on declared type after completion of the analysis 
(at) achieves almost the same precision as enforcing the types during the analysis (Ot). 
However, notice that during the analysis (nt), between 15% and 20% of the points-to sets 
at dereference sites are over 1000 elements in size. These large sets increase memory 
requirements prohibitively, and slow the analysis considerably. We therefore recommend 
enforcing declared types as the analysis proceeds, which eliminates almost all of these 
large sets. Further, based on this observation, we focus on analyses that respect declared 
types for the remainder of this paper. 



Call Graph Construction: As we have already mentioned, the call graph used for an 
inter-procedural points-to analysis can be constructed ahead of time using, for example, 
CHA 13, or on-the-fly as the analysis proceeds [20|], for greater precision. We abbreviate 
these variations as cha and Otf, respectively. As the third and fourth lines for each 
benchmark in Table [2] show, computing the call graph on-the-fly increases the number 
of points-to sets of size zero (dereference sites determined to be unreachable), but has a 
smaller effect on the size distribution of the remaining sets. 



Field Dereference Expressions: We study the handling of field dereference expressions 
in a field-based (abbreviated fb) and field-sensitive (fs) manner. Comparing rows 3 and 
5 (on-the-fly call graph), and rows 4 and 6 (CHA call graph), for each benchmark, we 
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see that field- sensitive analysis is more precise than the field-based analysis. Thus, it is 
probably worthwhile to do field- sensitive analysis if the cost of the analysis is reasonable. 
As we will see later, in Table [4j with the appropriate solver, the field- sensitive analysis 
can be made to be quite competitive to the field-based analysis. 



4.3 Factors Affecting Performance 

Set Implementation: We evaluated the analyses with the four different implementations 
of points-to sets described in SectionQl Table El shows the efficiency of the implemen- 
tations using two of the propagation algorithms: the naive, iterative algorithm, and the 
incremental worklist algorithm. For both algorithms, we respected declared types dur- 
ing the analysis, used a CHA call graph, and simplified the pointer assignment graph 
by collapsing cycles and variables with common predecessors as described in m.The 
“Graph space” column shows the space needed to store the original pointer assignment 
graph, and the remaining space columns show the space needed to store the points-to 
sets. The data structure storing the graph is designed for flexibility rather than space 
efficiency; it could be made smaller if necessary. In any case, its size is linear in the size 
of the program being analyzed. 



Table 3. Set Implementation (time in seconds, space in MB). 



Algorithm 


Graph 

space 


Hash 

time space 


Array 
time space 


Bit 

time space 


Hybrid 
time space 


compress 


Iterative 


31 


3448 


311 


1206 


118 


36 


75 


24 


34 




Incremental Worklist 


31 


219 


319 


62 


57 


14 


155 


9 


53 


javac 


Iterative 


34 


3791 


361 


1114 


139 


50 


88 


33 


41 




Incremental Worklist 


34 


252 


369 


61 


68 


19 


181 


13 


65 


sablecc 


Iterative 


36 


4158 


334 


1194 


132 


50 


93 


32 


42 




Incremental Worklist 


36 


244 


342 


54 


62 


17 


193 


11 


66 


jedit 


Iterative 


42 


6502 


583 


2233 


229 


91 


168 


59 


77 




Incremental Worklist 


42 


488 


597 


135 


114 


38 


349 


24 


128 



The terrible performance of the hash set implementation is disappointing, as this is 
the implementation provided by the standard Java library. Clearly, anyone serious about 
implementing an efficient points-to analysis in Java must write a custom set representa- 
tion. 

The sorted array set implementation is prohibitively expensive using the iterative 
algorithm, but becomes reasonable using the incremental worklist algorithm, which is 
designed explicitly to limit the size of the sets that must be propagated. 

The bit set implementation is much faster still than the sorted array set implementa- 
tion. However, especially when used with the incremental worklist algorithm, its memory 
usage is high, because the many small sets are represented using the same size bit- vector 
as large sets. In addition, the incremental worklist algorithm splits each points-to set into 
two halves, making the bit set implementation use twice the memory. 
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Finally, the hybrid set implementation is even faster than the bit set implementation, 
while maintaining modest memory requirements. We have found the hybrid set imple- 
mentation to be consistently the most efficient over a wide variety of settings of the other 
parameters, and therefore recommend that it always be used. 



Points-to Set Propagation Algorithms: Table0|shows the time and space requirements 
of the propagation algorithms included in Spark. All measurements in this table were 
made using the hybrid set implementation, and without any simplification of the pointer 
assignment graphf] Again, the “Graph space” column shows the space needed to store 
the original pointer assignment graph, and the remaining space columns show the space 
needed to store the points-to sets. 



Table 4. Propagation Algorithms (time in seconds, space in MB). 







Graph 


Iterative 


Worklist 


Incr. Worklist 






space 


time space 


time space 


time 


space 


compress 


nt-otf-fs 


32 


1628 


357 


992 


365 


399 


605 




ot-otf-fs 


37 


133 


52 


58 


51 


52 


69 




ot-cha-fs 


36 


49 


68 


15 


63 


13 


91 




ot-otf-fb 


35 


158 


54 


86 


52 


66 


66 




ot-cha-fb 


34 


17 


62 


10 


56 


13 


76 


javac 


nt-otf-fs 


34 


2316 


502 


1570 


512 


715 


856 




ot-otf-fs 


40 


201 


69 


103 


66 


90 


90 




ot-cha-fs 


39 


64 


83 


22 


77 


18 


109 




ot-otf-fb 


37 


218 


70 


123 


66 


102 


84 




ot-cha-fb 


37 


22 


75 


11 


67 


15 


90 


sablecc 


nt-otf-fs 


35 


2190 


462 


1382 


472 


635 


772 




ot-otf-fs 


41 


274 


72 


104 


70 


95 


94 




ot-cha-fs 


41 


66 


88 


20 


83 


18 


117 




ot-otf-fb 


38 


255 


74 


138 


72 


114 


90 




ot-cha-fb 


38 


52 


81 


14 


74 


18 


97 


jedit 


nt-otf-fs 


oom 


oom 


oom 


oom 


oom 


oom 


oom 




ot-otf-fs 


49 


313 


121 


142 


117 


101 


169 




ot-cha-fs 


48 


107 


141 


59 


131 


38 


196 




ot-otf-fb 


47 


298 


104 


178 


99 


111 


126 




ot-cha-fb 


45 


28 


109 


21 


98 


27 


128 



The nt-Otf-fs line shows how much ignoring declared types hurts efficiency (the 
“oom” for jedit signifies that the analysis exceeded the 1700MB of memory allot- 
ted); we recommend that declared types be respected. Results from the recommended 
algorithms are in bold. 

^ The time and space reported for the hybrid set implementation in Table |3] are different than in 
Table|4]because the former were measured with off-line pointer assignment graph simplification, 
and the latter without. 
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The iterative algorithm is consistently slowest, and is given as a baseline only. The 
worklist algorithm is usually about twice as fast as the iterative algorithm. For the 
CHA field-based analysis, this algorithm is consistently the fastest, faster even than the 
incremental worklist algorithm. This is because the incremental worklist algorithm is 
designed to propagate only the newly-added part of the points-to sets in each iteration, 
but the CHA field-based analysis requires only a single iteration. Therefore, any benefit 
from its being incremental is outweighed by the overhead of maintaining two parts of 
every set. However, both field-sensitivity and on-the-fiy call graph construction require 
iteration, so for these, the incremental worklist algorithm is consistently fastest. We note 
that the speedup comes with a cost in the memory required to maintain two parts of 
every set. 

Note also that while the field-based analysis is faster than field-sensitive with a 
CHA call graph, it is slower when the call graph is constructed on the fly (with all 
propagation algorithms). This is because although a field-based analysis with a CHA call 
graph completes in one iteration, constructing the call graph on-the-fiy requires iterating 
regardless of the field representation. The less precise field-based representation causes 
more methods to be found reachable, increasing the number of iterations required. 



Graph Simplification: Rountev and Chandra fT^ showed that simplifying the pointer 
assignment graph by merging nodes known to have equal points-to sets speeds up the 
analysis. Our experience agrees with their findings. 

When respecting declared types, a cycle can only be merged if all nodes in the cycle 
have the same declared type, and a subgraph with a unique predecessor can only be 
merged if all its nodes have declared types that are supertypes of the predecessor. On 
our benchmarks, between 6% and 7% of variable nodes were removed by collapsing 
cycles, compared to between 5% and 6% when declared types were respected. Between 
59% and 62% of variable nodes were removed by collapsing subgraphs with a unique 
predecessor, compared to between 55% and 58% when declared types were respected. 
Thus, the effect of respecting declared types on simplification is minor. 

On the other hand, when constructing the call graph on-the-fiy, no inter-procedural 
edges are present before the analysis begins. This means that any cycles spanning mul- 
tiple methods are broken, and the corresponding nodes cannot be merged. The 6%-7% 
of nodes removed by collapsing cycles dropped to 1%-1.5% when the call graph was 
constructed on-the-fiy. The 59%-62% of nodes removed by collapsing subgraphs with 
a unique predecessor dropped to 31%-33%. When constructing the call graph on-the- 
fiy, simplifying the pointer assignment graph before the analysis has little effect, and 
on-the-fiy cycle detection methods should be used instead. 

4.4 Overall Results 

Based on our experiments, we have selected three analyses that we recommend as good 
compromises between precision and speed, with reasonable space requirements: 
Ot-Otf-fs is suitable for applications requiring the highest precision. For this analysis, 
the incremental worklist algorithm works best. 

Ot-cha-fs is much faster, but with a drop in precision as compared to Ot-Otf-fs (mostly 
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because it includes significantly more call edges). For this analysis, the incremental 
worklist algorithm works best. 

Ot-cha-fb is the fastest analysis, completing in a single iteration, but it is also the least 
precise. For this analysis, the non-incremental worklist algorithm works best. 

Each of the three analyses should be implemented using the hybrid sets. 



Table 5. Overall Results (time in seconds, space in MB, precision in precent). 



Benchmark 


methods 

(CHA) 


stmts 

(CHA) 


types 


ot-otf-fs 

time space prec. 


ot-cha-fs 

time space prec. 


ot-cha-fb 

time space prec. 


compress 


15183 


278902 


2770 


52 


106 


69.1 


13 


127 


60.1 


10 


90 


57.6 


db 


15185 


278954 


2763 


52 


107 


68.9 


14 


128 


59.9 


11 


90 


57.4 


jack 


15441 


288142 


2816 


54 


112 


68.7 


14 


132 


60.1 


11 


94 


57.6 


javac (1.1.8) 


4602 


86454 


874 


8 


27 


63.6 


3 


24 


57.4 


1 


16 


55.1 


javac 


16307 


301801 


2940 


89 


131 


66.3 


18 


148 


58.4 


11 


104 


56.2 


jess 


15794 


288831 


2917 


57 


115 


68.1 


15 


136 


59.2 


10 


97 


56.8 


mpegaudio 


15385 


283482 


2782 


56 


112 


68.6 


16 


134 


59.7 


11 


93 


57.4 


raytrace 


15312 


281587 


2789 


53 


107 


68.5 


13 


129 


59.6 


11 


91 


57.1 


sablecc 


16977 


300504 


3070 


95 


136 


70.5 


18 


158 


62.5 


14 


112 


60.3 


soot 


17498 


310935 


3435 


88 


143 


68.3 


19 


162 


60.4 


18 


116 


58.4 


Jedit 


19621 


367317 


3395 


100 


218 


69.1 


38 


244 


62.3 


21 


143 


61.1 



Table 1^ shows the results of these three analyses on our full set of benchmarks. The 
first column gives the benchmark name (javac is listed twice: once with the 1.1.8 
JDK class library, and once with the 1.3.1_01 JDK class library). The next two columns 
give the number of methods determined to be reachable, and the number of Jimple0 
statements in these methods. Note that because of the large class library, these are the 
largest Java benchmarks for which a subtype-based points-to analysis has been reported, 
to our knowledge. The fourth column gives the number of distinct types encountered 
by the subtype tester. The remaining columns give the analysis time, total space, and 
precision for each of the three recommended analyses. The total space includes the 
space used to store the pointer assignment graph as well as the points-to sets; these were 
reported separately in previous tables. The precision is measured as the percentage of 
field dereference sites at which the points-to set of the pointer being dereferenced has 
size 0 or 1 ; for a more detailed measurement of precision, see Table |2] 

5 Related Work 

The most closely related work are various adaptations of points-to analyses for C to Java. 

Rountev, Milano va and Ryder [|201 based their field- sensitive analysis for Java on 
Soot ll26l and the BANE O constraint solving toolkit, on which further points-to analysis 
work has been done [112112311 . Their analysis was field- sensitive, constructed the call graph 

Jimple is the three-address typed intermediate representation used by Soot. 
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on-the-fly, and ignored declared types until after the analysis completed. They reported 
empirical results on many benchmarks using the JDK 1.1.8 standard class library. Since 
they do not handle declared types during the analysis, their implementation suffers 
from having to represent large points-to sets, and is unlikely to scale well to large class 
libraries. They do not report results for the JDK 1.3.1 library, but their results for j a vac 
(1.1.8) show 350 seconds and 125.5 MB of memory (360 MHz Sun Ultra-60 machine 
with 512 MB of memory, BANE solver written in ML), compared to 8 seconds and 27 
MB of memory (1.67 GHz AMD Athlon with 2GB memory, solver written in Java) for 
the Ot-Otf-fs analysis using Spark. The precision of our results should be very slightly 
better, since the Rountev et. al. method is equivalent to our at-Otf-fs analysis, which we 
showed to be slightly less precise that the Ot-Otf-fs analysis. 

Whaley and Lam ’s(23 approach is interesting in that it adapts the demand-driven 
algorithm of Heintze and Tardieu [T5l[T4l (see below) to Java. The intermediate repre- 
sentation on which their analysis operates is different from Jimple (on which our and 
Rountev, Milanova and Ryder’s analyses are based) in that it does not split stack loca- 
tions based on DU-UD webs; instead, it uses intra-method flow- sensitivity to achieve 
a similar effect. In contrast with other work that used a conservative (safe) approxi- 
mation of reachable methods which to analyze, Whaley and Lam’s experiments used 
optimistic assumptions (not safe) about which methods need to be analyzed. In particu- 
lar, the results presented in their paper are for a variation of the analysis that does 
not analyze class initializers and assumes that all native methods have no effect on the 
points-to analysis. Their optimistic assumptions about which methods are reachable lead 
to reachable method counts almost an order of magnitude lower than reported in other 
related work, such as the present paper, and [I20I24I1 ; in fact, they analyze significantly 
fewer methods than can be observed to be executed at run-time in a standard run of the 
benchmarks. As a result of the artificially small number of methods that they analyze, 
they get fast execution times. Even so, when looking at the j edi t benchmark, the only 
benchmark for which they analyze at least half of the number of methods analyzed by 
Spark, their analysis runs in 614 seconds and 472 MB of memory (2 GHz Pentium 4, 
2GB of memory, solver written in Java), compared to 100 seconds and 218 MB for the 
most precise analysis in Spark (1.67 GHz AMD Athlon, 2GB memory, solver written 
in Java). 

Our comparison with these two other previous works for points-to analysis for Java 
illustrates two important things. Lirst, it would be nice if we could compare the analyses 
head to head, on the same system, with the same assumptions about what code needs to 
be analyzed. Second, it appears that Spark allows one to develop efficient analyses that 
compare very favourably with previous work. 

Liang, Pennings and Harrold fTf] tested several variations of Java points-to analy- 
ses, including subset-based and equality-based variations, field-based and field- sensitive 
variations, and constructing the call graph using CHA (SI and RTA ||7]. Instead of ana- 
lyzing benchmarks with the standard class library, they hand-coded a model of the most 
commonly used JDK 1.1.8 standard classes. Thus, we cannot make direct comparisons, 
since our results include all the library code. 

Heintze and Tardieu jEiini reported very fast analysis times using their analysis 
for C. The main factor making it fast was a demand-driven algorithm that also collapsed 
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cycles in the constraint graph on-the-fly. Such a demand-driven algorithm is particularly 
useful when the points-to sets of only a subset of pointer variables are required; we plan 
to implement it in a future version of Spark for such applications. In addition, in an 
unpublished report ca, Heintze discusses an implementation of sets using bit- vectors 
which are shared, so that copies of an identical set are only stored once. We are also 
considering implementing this set representation in Spark. 

Since points-to analysis in general is a very active area of research, we can only list 
the work most closely related to ours. A more complete survey appears in ca. 

6 Conclusions and Future Work 

We have presented Spark, a flexible framework for experimenting with points-to analysis 
for Java. Our empirical results have shown that Spark is not only flexible, but also 
competitive with points-to analyses that have been implemented in other frameworks. 
Using Spark, we studied various factors affecting the precision and efficiency of points- 
to analysis. Our study led us to recommend three specific analyses, and we showed that 
they compare favourably to other analyses that have been described in the literature. We 
plan several improvements to Spark. First, we would like to create an on-the-fly pointer 
assignment graph builder, so that the entire pointer assignment graph need not be built 
for an on-the-fly call graph analysis. Second, we would like to add Heintze and Tardieu’s 
demand-driven propagation algorithm to Spark. 

We have several studies in mind that we would like to perform using Spark. First, we 
are implementing points-to analysis using Reduced Ordered Binary Decision Diagrams 
to store the large, often duplicated sets. Second, we plan to study the effects of various 
levels of context-sensitivity on Java points-to analysis. Third, we will experiment with 
various clients of the points-to analysis. 
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Abstract. Inline-threaded interpretation is a recent technique that im- 
proves performance by eliminating dispatch overhead within basic blocks 
for interpreters written in C m- The dynamic class loading, lazy class 
initialization, and multi-threading features of Java reduce the effective- 
ness of a straight-forward implementation of this technique within Java 
interpreters. In this paper, we introduce preparation sequences, a new 
technique that solves the particular challenge of effectively inline-threa- 
ding Java. We have implemented our technique in the SableVM Java 
virtual machine, and our experimental results show that using our tech- 
nique, inline-threaded interpretation of Java, on a set of benchmarks, 
achieves a speedup ranging from 1.20 to 2.41 over switch-based inter- 
pretation, and a speedup ranging from 1.15 to 2.14 over direct-threaded 
interpretation. 



1 Introduction 

One of the main advantages of interpreters written in high-level languages is their 
simplicity and portability, when compared to static and dynamic compiler-based 
systems. One of their main drawbacks is poor performance, due to a high cost for 
dispatching interpreted instructions. In m, Piumarta and Riccardi introduced 
a technique called inlined-threading which reduces this overhead by dynamically 
inlining instruction sequences within basic blocks, leaving a single instruction 
dispatch at the end of each sequence. To our knowledge, inlined-threading has 
not been applied to Java interpreters before. Applying this inline-threaded tech- 
nique within an interpreter-based Java virtual machine (JVM) is unfortunately 
difficult, as Java has features that conflict with a straight-forward implemen- 
tation of the technique. In particular, the JVM specification [HI mandates lazy 
class initialization, permits lazy class loading and linking, and mandates support 
for multi- threading. Efficient implementation of laziness requires in-place code 
replacement which is a delicate operation to do within a multi-threaded envi- 
ronment. In this paper, we introduce a technique called preparation sequenees 
which solves the synchronization and shorter sequence problems caused by in- 
place code replacement within an inline-threaded interpreter-based JVM. 
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This paper is structured as follows. In Section [2| we briefly describe some 
interpreter instruction dispatch techniques, including inline-threading. Then, in 
Section [ 3 ] we discuss the difficulty of applying the inline-threaded technique in 
a Java interpreter. Next, in section |4] we introduce our preparation sequences 
technique. In Section [Sj we present our experimental results within the SableVM 
framework. In Section 0 we discuss related work. Finally, in Section [71 we present 
our conclusions. 

2 Dispatch Types 

In this section, we describe three dispatch mechanisms generally used for imple- 
menting interpreters. 



Switching. A typical bytecode interpreter loads a bytecode program from disk 
using standard file operations, and stores instructions into an array. It then dis- 
patches instructions using a simple loop-embedded switch statement, as shown 
in Figure E^a). This approach has performance drawbacks. Dispatching instruc- 
tions is very expensive. A typical compilation of the dispatch loop requires a 
minimum of 3 control transfer machine instructions per iteration: one to jump 
from the previous bytecode implementation to the head of the loop, one to test 
whether the bytecode is within the bounds of handled switch-case values, and 
one to transfer control to the selected case statement. On modern processors, 
control transfer is one of the main obstacles to performance [7j , so this dispatch 
mechanism causes significant overhead. 



Direct-Threading. This technique was popularized by the Forth programming 
language j^. Direct-threading improves on switch-based dispatch by eliminating 
central dispatch. In the executable code stream, each bytecode is replaced by 
the address of its associated implementation. This reduces, to one, the number 
of control transfer instructions per dispatch. Direct-threading is illustrated in 

Figure [l](b|3- 



Inline-Threading. This technique, recently introduced in m , improves upon 
direct-threading by eliminating dispatch overhead for instructions within a basic 
block [I]. The general idea is to identify instruction sequences forming basic 
blocks, within the code array, then to dynamically create a new implementation 
for the whole sequence by sequentially copying the body of each implementation 
into a new buffer, then copying the dispatch code at the end. Finally a pointer to 
this sequence implementation is stored into the code array, replacing the original 
bytecode of the first instruction in the sequence. Figure E illustrates the creation 
of an instruction sequence implementation and shows an abstract source code 
representation of the resulting inlined instruction sequence implementation. 

^ Figure [Tib) uses the label- as-value GNU C extension, but direct-threading can also 
be implemented using a couple of macros containing inline assembly. 
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(a) Pure Switch-Based Interpreter 


(b) Direct-Threaded Interpreter 


char code [CODESIZE] ; 


/* code */ 


char *pc = code; 


void *code[] = { 


int stack [STACKSIZE] ; 


&&IC0NST 2, &&IC0NST 2, 


int *sp = stack; 


&&IC0NST_1, &&IADD, ... 

T. 


/♦ load bytecodes from file and 


J 

void **pc = code; 


store them in code [] */ 




/♦ dispatch first instruction */ 


/* dispatch loop ♦/ 


goto **(pc++); 


while (true) { 


switch(*pc++) { 


/♦ implementations */ 


case IC0NST_1: *sp++ = 1; break; 


IC0NST_1: *sp++ = 1; goto **(pc++); 


case IC0NST_2: *sp++ = 2; break; 


IC0NST_2: *sp++ = 2; goto **(pc++); 


case lADD: — sp; sp[-l] += *sp; break; 


lADD: — sp; sp[-l] += *sp; 


case END: exit(O); 


goto **(pc++); 


}} 





Fig. 1. Switch and Direct-Threaded Interpreters 



Inline-threading improves performance by reducing the overhead due to dis- 
patch. This is particularly effective for sequences of simple instructions, which 
have a high dispatch to real work ratio. Unfortunately, not all instructions can 
be inlined. Inlining instructions that contain C function calls, hidden (compiler 
generated) function calls, or even simple conditional expressions (in presence of 
some compiler optimizations) can prevent inlinin^. 

3 The Difficulty of Inline-Threading Java 

Lazy Loading and Preparation. In Java, classes are dynamically loaded. 
The JVM Specification allows a virtual machine to eagerly or lazily load 
classes (or anything in between). But this flexibility does not extend to class 
initializatioT^. Class initialization must occur at specific execution points, such 
as the first invocation of a static method or the first access to a static field of 
a class. Lazily loading classes has many advantages: it saves memory, reduces 
network traffic, and reduces startup overhead. 

Inline-threading requires analyzing a bytecode array to determine basic blo- 
cks, allocating and preparing implementation sequences, and lastly preparing 
a code array. As this preparation is time and space consuming, it is advisable 
to only prepare methods that will actually be executed. This can be achieved 
through lazy method preparation. 



Performance Issue. Lazy preparation (and loading), which aims at improving 

performance, can pose a performance problem within a multi-threadecfl environ- 

^ The target of a relative branch instruction might be invalid in the inlined instruction 
copy. 

^ Class initialization consists of initializing static fields and executing static class ini- 
tializers. 

^ Note that multi-threading is a concurrent programming technique which is inherently 
supported in Java, whereas inline-threading is an instruction dispatch technique. 
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(a) Instruction Implementations 



(c) Inlined Instruction Sequence 



IC0NST_1_START: *sp++ = 1; 
IC0NST_1_END: goto **(pc++); 

INEG.START: sp[-l] = -sp[-l]; 
INEG_END: goto **(pc++); 

DISPATCH_START: goto **(pc++); 
DISPATCH.END: ; 



IC0NST_1 body 
INEG body 
DISPATCH body 



♦sp++ = 1; 
sp[-l] = -sp[-l] ; 
goto **(pc++); 



(b) Sequence Computation 



/* Implement the sequence IC0NST_1 INEG */ 

size_t iconst_size = (&&IC0NST_1_END - &&IC0NST_1_START) ; 

size_t ineg_size = (&&INEG_END - &&INEG_START) ; 

size_t dispatch_size = (&&DISPATCH_END - &&DISPATCH_START) ; 

void *buf = malloc (iconst_size + ineg_size + dispatch_size) ; 
void ♦ current = buf ; 

memcpy (current , &&ICONST_START, iconst_size) ; current += iconst_size; 
memcpy (current , &&INEG_START, ineg_size) ; current += ineg_size; 
memcpy (current , &&DISPATCH_START, dispatch_size) ; 

/♦ Now, it is possible to execute the sequence using: ♦/ 
goto **buf ; 



Fig. 2. Inlining a Sequence 



merit. The problem is that, in order to prevent corruption of the internal data 
structure of the virtual machine, concurrent preparation of the same method (or 
class) on distinct Java threads should not be allowed. 

The natural approach, for preventing concurrent preparation, is to use syn- 
chronization primitives such as pthread mutexe^. But, this approach can have a 
very high performance penalty; in a naive implementation, it adds synchroniza- 
tion overhead to every method call throughout a program’s execution, which is 
clearly unacceptable, specially for multi- threaded Java applications. 



Broken Sequences. An important performance factor of inline-threading is the 
length of inlined instruction sequences. Longer sequences reduce the dispatch- 
to-real work ratio and lead to improved performance. Lazy class initialization 
mandates that the first call to a static method (or access to a static field) must 
cause initialization of a class. This implies (in a naive Java virtual machine im- 
plementation) that instructions such as GETSTATIC must use a conditional to test 
whether the target class must be initialized prior to performing the static field 
access. If initialization is required, a call to the initialization function must be 
made. The conditional and the C function call prevent inlining of the GETSTATIC 
instruction. 

What we would like, is to use two versions of the GETSTATIC instruction, as 
shown in Figure |3] and replace the slow synchronized version by the fast version 
after initialization. Unfortunately this does not completely solve our performance 
problem. Even though this technique eliminates synchronization overhead from 
most executions of the GETSTATIC instruction, it inhibits the removal of dispatch 
code in an instruction which has very little real work to do. In fact, the cost can 

^ POSIX Threads mutual exclusive locks. 
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be as high as the execution of two additional dispatches. To measure this, we 
compare the cost of two inline-threaded instruction sequences that only differ in 
their respective use of ILOAD and GETSTATIC in the middle of the sequence. 



Synchronized GETSTATIC 


Unsynchronized GETSTATIC 


GETSTATIC.INIT: /♦ pseudo-code ♦/ 

pthread_mutex_lock( . . . ) ; 

/♦ lazily load class */ 

/* conditional & function call ♦/ 
if (must_initialize) 
initialize_class (...); 

/♦ do the real work ♦/ 

*sp++ = class . static_field; 

/♦ replace by fast version */ 
code [pc -1] = &&GETSTATIC_NO_INIT; 

pthread_mutex_unlock( . . . ) ; 

/* dispatch */ 
goto **(pc++); 


GETSTATIC_NO_INIT: /♦ pseudo-code */ 

/* do the real work ♦/ 

♦sp++ = class . static_field; 

/♦ dispatch ♦/ 
goto **(pc++); 



Fig. 3. GETSTATIC With and Without Initialization 



Broken Sequence Cost. If we had the sequence of instructions IC0NST2- 
- ILOAD- 1 ADD, we could build a single inlined sequence for these three instruc- 
tions, adding a single dispatch at the end of this sequence. Cost: 3xrealwork-\-l x 
dispatch. If, instead, we had the sequence of instructions IC0NST2-GETSTATIC- 
-lADD, we would not be allowed to create a single inlined sequence for the three 
instructions. This is because, in the prepared code array, we would need to put 
3 distinct instructions: IC0NST2, GETSTATIC_INIT, and lADD, where the middle 
instruction cannot be inlined. Even though the GETSTATIC_INIT will eventually 
be replaced by the more efficient GETSTATIC_NO_INIT, the performance cost, af- 
ter replacement, will remain: 3 x realwork + 3 x dispatch. So, the overhead of 
a broken sequence can be as high as two additional dispatches. 



Two- Values Replacement. In reality, the problem is even a little deeper. The 
pseudo-code of Figure El hides the fact that GETSTATIC_INIT needs to replace two 
values, in the code array: the instruction opcode and its operand. The idea is 
that we want the address of the static variable as an operand (not an indirect 
pointer) to achieve maximum efficiency, as shown in Figured But this pointer is 
unavailable at the time of preparation of the code array, as lazy class loading only 
takes place later, within the implementation of the GETSTATIC_INIT instruction. 

Replacing two values without synchronization creates a race condition. Here 
is a short illustration of the problem. A first Java thread reads both initial values, 
does the instruction work, then replaces the first of the two values. At this exact 
point of time (before the second value is replaced), a second Java thread reads 
the two values (instruction and operand) from memory. The second Java thread 



Effective Inline- Threaded Interpretation of Java Bytecode 



175 



Fast Instruction 


Code Array 


GETSTATIC_NO_INIT: 

{ int Upvalue = 

(pc++) ->pvalue ; 
*sp++ = Upvalue; 

} 

/♦ dispatch ♦/ 
goto **(pc++); 


/♦ Initially ♦/ 

[GETSTATIC INIT] 
[P0INTER_T0_FIELD_INF0] 

/♦ After first execution */ 

[GETSTATIC NO INIT] 
[POINTER_TO_FIELD] 



Fig. 4. Two- Values Replacement in Code Array 



will thus get the fast instruction opcode and the old field info pointer. This can 
of course lead to random execution problems. 



4 Preparation Sequences 

In this section, we first introduce an incomplete solution to the problems dis- 
cussed in Sectional then we introduce our preparation sequences technique. 



Incomplete Solution. The two problems we face are two-values replacement^ 
and shorter sequences caused by the slow preparation version of instructions such 
as GETSTATIC. Of course, there is a simple solution to two- values replacement 
that consists of using single- value replacemenl[^ and an indirection in the fast 
version of instructions, as shown in Figure O Note how this implementation 
differs from Figure [H in particular the additional fieldinfo indirection. This 
simple solutions comes at a price, though: that of an additional indirection in 
a very simple instruction. Furthermore, this solution does not solve the shorter 
sequences problem. 



Fast Instruction with Indirection 


Code Array 


GETSTATIC_NO_INIT: 


/♦ Initially ♦/ 
[GETSTATIC INIT] 


{ int Upvalue = 


[P0INTER_T0_FIELD_INF0] 


(pc++) ->f ieldinf o->pvalue ; 




*sp++ = Upvalue; 

} 


/♦ After first execution */ 


/♦ dispatch ♦/ 




goto **(pc++); 


[GETSTATIC NO INIT] 


[P0INTER_T0_FIELD_INF0] 



Fig. 5. Single- Value Replacement of GETSTATIC 



Single-value replacement does not require synchronization when there is a single 
aligned word to change. 
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The Basic Idea. Instead, we propose a solution that solves both problems. This 
solution consists of adding preparation sequences in the code array. The basic idea 
of preparation sequences is to duplicate certain portions of the code array, leaving 
fast inlined- sequences in the main copy, and using slower, synchronized, non- 
inlined preparation version of instructions in the copy. Single- value replacement 
is then used to direct control flow appropriately. 



Single-Instruction Preparation Sequence. Preparation sequences are best 
explained using a simple illustrative example. We continue with our GETS! AT I C 
example. We assume, for the moment, that the GETSTATIC instruction is pre- 
ceded and followed by non-inlinable instructions, in the code array. An appro- 
priate instruction sequence would be MONITORENTER-GETSTATIC-MONITOREXIT, 
as neither monitor instruction is inlinable. 

Figure [6j (a) and (b), illustrates the initial content of a prepared code ar- 
ray containing the above 3-instructions sequence. The GETSTATIC preparation 
sequence appears at the end of the code array. The initial content of the code ar- 
ray is as follows. After the MONITORENTER, we insert a GOTO instruction followed 
by two operands: (i) the address of the GETSTATIC preparation sequence, and 
(ii) an additional word (initially NULL) which will eventually hold a pointer to 
the static field. At the end of the code array, we add a preparation sequence, 
which consists of 3 instructions (identified by a *) along with their operands. 



(a) Original 
Bytecode 


(b) Initial Content of 
Code Array 


(c) GETSTATIC.INIT 


MONITORENTER 

GETSTATIC 

INDEXBYTEl 

INDEXBYTE2 

MONITOREXIT 


[MONITORENTER] ♦ 

OPCODE 1 : [GOTO] ♦ 

[0 SEQUENCE 1] 

OPERAND 1: [NULL POINTER] 

NEXT.l: [MONITOREXIT]* 

SEQUENCE 1: [GETSTATIC INIT] ♦ 

[POINTER TO FIELDINFO] 
[0 OPERAND 1] 

[REPLACE] * 

[GETSTATIC NO INIT] 

[0 OPCODE 1] 

[GOTO] ♦ 

[0 NEXT_1] 

Opcodes followed by a * are 
instructions . 


GETSTATIC.INIT: 

{ fieldinfo_t *fieldinfo = 

(pc++) ->f ieldinf o ; 
int **destination = 
(pc++)->ppint ; 
pthread_mutex_lock( . . . ) ; 

/* lazily load and initialize 
class, and resolve field */ 

/* store field information in 
code array ♦/ 

♦destination = 

f ieldinf o->pvalue ; 

/♦ do the real work ♦/ 

♦sp++ = ♦ (f ieldinf o->pvalue) ; 
pthread mutex unlock (. . ; 

} 

/* dispatch */ 
goto **(pc++); 


(d) GETSTATIC_NO_INIT 


(e) GOTO 


(f) REPLACE 


GETSTATIC_NO_INIT: 

/* skip address ♦/ 
pc++; 

{ int Upvalue = 

(pc++) ->pvalue ; 

/* do the real work */ 
♦sp++ = Upvalue; 

} 

/♦ dispatch ♦/ 
goto **(pc++); 


GOTO: 

{ void ^address = 

(pc++) ->address ; 
pc = address; 

} 

/♦ dispatch */ 
goto **(pc++); 


REPLACE: 

■[ void *instruction = 

(pc++)->instruction; 
void **destination = 
(pc++) ->ppvoid ; 
♦destination = 
instruction; 

} 

/♦ dispatch ♦/ 

goto ♦♦(pc++); 



Fig. 6. Single GETSTATIC Preparation Sequence 
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Figure [H (c) to (I), shows the implementation of four instructions: GOTO, 
REPLACE, GETSTATIC_INIT, and GETSTATIC_NO_INIT. Notice that in the prepa- 
ration sequence, the GETSTATIC_NO_INIT opcode is used as an operand to the 
REPLACE instruction. 

We used labels (e.g. SEQUENCE_1 : ) to represent the address of specific op- 
codes. In the real code array, absolute addresses are stored in opcodes such as 

[@ SEQUENCE_1]. 

Here is how execution proceeds. On the first execution of this portion of the 
code, the MONITORENTER instruction is executed. Then, the GOTO instruction is 
executed, reading its destination in the following word. The destination is the 
SEQUENCE_1 label, or more accurately, the GETSTATIC_INIT opcode, at the head 
of the preparation sequence. 

The GETSTATIC_INIT instruction then reads two operands: (a) a pointer to 
the field information structure, and (b) a destination pointer for storing a pointer 
to the resolved static field. It then proceeds normally, loadiM and initializing 
the class, and resolving the field, if it hasn’t yet been donqj. Then, it stores 
the address of the resolved field in the destination location. Notice that, in the 
present case, this means that the pointer-to-field will overwrite the NULL value 
at label OPERAND _1. Finally, it executes the real work portion of the instruction, 
and dispatches to the next instruction. 

The next instruction is a special one, called REPLACE, which simply stores 
the value of its first operand into the address pointed-to by its second operand. 
In this particular case, a pointer to the GETSTATIC_NO_INIT instruction will be 
stored at label 0PC0DE_1, overwriting the former GOTO instruction pointer. This 
constitutes, in fact, our single-value replacement. 

The next instruction is simply a GOTO used to exit the preparation sequence. 
It jumps to the instruction following the original GETSTATIC bytecode, which in 
our specific case is the MONITOREXIT instruction. 

Future executions of the same portion of the code array will see a GETSTA- 
TIC_NO_INIT instruction (at label 0PC0DE_1), instead of a GOTO to the preparation 
sequence. Two- values replacement is avoided by leaving the GOTO operand address 
in place. Notice how the implementation of GETSTATIC_NO_INIT in Figure El (d) 
differs from the implementation in Figure [H by an additional pc++ to skip the 
address operand. 



Some Explanations. Our single-instruction preparation sequence has avoided 
two- values replacement by using an extra word to permanently store a prepara- 
tion sequence address operand, even though this address is useless after initial 
execution. 

This approach adds some overhead in the fast version of the overloaded in- 
struction; that of a program-counter increment, to skip the preparation sequence 
address. One could easily question whether this gains any performance improve- 

^ Each field is only resolved once, yet there can be many GETSTATIC instructions 
accessing this field. The same holds for class loading and initialization. 
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merit over that of using an indirection as in Figured This will be answered by 
looking at longer preparation sequences. 

The strangest looking thing, is the usage of 3 distinct instructions in the 
preparation sequence. Why not use a single instruction with more operands? 
Again, the answer lies in the implementation of longer preparation sequences. 



Pull Preparation Sequences. We now proceed with the full implementation 
of preparation sequences. Our objective is two-fold: (a) we want to avoid two- 
values replacement, and (b) we want to build longer inlined instruction sequences 
for our inlined-threaded interpreter, for reducing dispatch overhead as much as 
possible. 

To demonstrate our technique, we use the three instruction sequence: IC- 

0NST2-GETSTATIC-IL0AD. 

Figure E] (a) and (b), shows the initial state of the code array, the content 
of the dynamically constructed IC0NST2-GETSTATIC-IL0AD inlined instruction 
sequence, some related instruction implementations, and the content of the code 
array after first execution. 

This works similarly to the single-instruction preparation sequence, with two 
major differences: (a) the jump to the preparation sequence initially replaces the 
IC0NST_2 instruction, instead of the GETSTATIC instruction, and (b) the REPLACE 
instruction stores a pointer to an inlined instruction sequence^ overwriting the 
GOTO instruction. 

Here is how execution proceeds in detail. On the first execution of this portion 
of the code, the GOTO instruction is executed. Its destination is the IC0NST_2 
opcode, at the head of the preparation sequence. 

Next, the IC0NST_2 instruction is executed. Next, the GETSTATIC_INIT in- 
struction reads two operands: (a) a pointer to the field information structure, 
and (b) a destination pointer for storing a pointer to the resolved static field. It 
then proceeds normally, loading and initializing the class, and resolving the field, 
if it hasn’t yet been done. Then, it stores the address of the resolved field in the 
destination location. Finally, it executes the real work portion of the instruction, 
and dispatches to the next instruction. 

The next instruction is a REPLACE, which simply stores a pointer to the 
dynamically inlined instruction sequence IC0NST2-GETSTATIC-IL0AD at label 
0PC0DE_1, overwriting the former GOTO instruction, and performing a single-value 
replacement. 

Next, the ILOAD instruction is executed. Finally, the tail GOTO exits the prepa- 
ration sequence. 

Future executions of the same portion of the code array will see the IC0NST2- 
-GETSTATIC- ILOAD instruction sequence (at label 0PC0DE_1), as shown in Figure 
IZjf)- Notice that the inlined implementation o^ GETSTATIC_NO_INIT in Figure[?Kc) 
does not add any overhead to the fast implementation shown in Figure [H 

Thus, we have achieved our goals. In particular, we have succeeded at inlining 
an instruction sequence, even though it had a complex two-modes (preparation 
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(a) Bytecode 


(b) Initial Content of Code Array 


(c) GETSTATIC_N0_INIT 


icONST 2 

GETSTATIC 

INDEXBYTEl 

INDEXBYTE2 

ILOAD 

INDEX 


OPCODE ! : [GOTO] ♦ 

[0 SEQUENCE !] 

OPERAND !: [NULL POINTER] 

[INDEX] 

NEXT.! : 

SEQUENCE !: [ICONST 2]* 

[GETSTATIC INIT] ♦ 

[POINTER TO FIELDINFO] 

[0 OPERAND !] 

[REPLACE] ♦ 

[IC0NST2-GETSTATIC-IL0AD] 
[@ OPCODE !] 

[ILOAD] ♦ 

[INDEX] 

[GOTO] ♦ 

[0 NEXT.!] 

Opcodes followed by a ♦ are 
instructions . 


GETSTATIC.NO.INIT.START : 
t int *pvalue = 

(pc++) ->pvalue ; 

♦sp++ = *pvalue; 

} 

GETSTATIC.NO.INIT.END : 

/♦ dispatch ♦/ 
goto **(pc++); 


(d) SKIP 


(e) IC0NST2-GETSTATIC-IL0AD Inlined Instruction Sequence 


SKIP.START: 

♦pc++; 

SKIP.END: 

/♦ dispatch */ 
goto **(pc++); 


SKIP body : pc++; 

IC0NST.2 body : *sp++ = 2; 

GETSTATIC.no. INIT body: fint *pvalue = (pc++) ->pvalue ; 

♦sp++ = *pvalue;} 

ILOAD body : tint index = (pc++)->index; 

♦sp++ = locals [index] ; } 
DISPATCH body : goto **(pc++); 


(f) Code Array After First Execution 


SEQUENCE.! : [IC0NST.2] ♦ 

0PC0DE.1: iic0NST2-GETSTATIC-IL0A^^ [poImErTO^iSDINFO] 

L® btWUtiMOt.iJ |-« nPFRfiMn 

OPERAND.!: [POINTER.TO.FIELD] rppDfAoSi* 

riMUFYl LREPLAChJ ♦ 

Lii\iuE.Aj [IC0NST2-GETSTATIC-IL0AD] 

iMtAi.i. ... [0 OPCODE !] 

[ILOAD] ♦ 

[INDEX] 
r GQTQl ^ 

Opcodes followed by a ♦ are E Mgy-p h-i 

instructions. ^ 



Fig. 7. Full Preparation Sequence 



/ fast) instruction in the middle, while avoiding two- values replacement. All of 
this with minimum overhead in post-first execution of the code array. 



Detailed Preparation Procedure. Preparation of a code array, in anticipa- 
tion of inline-threading, proceeds as follows: 

1. Instructions are divided in three groups: inlinable, two-modes-inlinable (such 
as GETSTATIC), and non-inlinable. 

2. Basic blocks (determined by control- flow and non-inlinable instructions) are 
identified. 

3. Basic blocks of inlinable instructions, without two-modes-inlinable instruc- 
tions, are inlined normally. 

4. Every basic block containing two-modes-inlinable instructions causes the 
generation of an additional preparation sequence at the end of the code array, 
and the construction of a related inlined instruction sequence. 
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The construction of a preparation sequence proceeds as follows: 

1. Instructions are copied sequentially into the preparation sequence. 

— Inlinable instructions and their operands are simply copied as-is. 

— The preparation version of two-modes- inlinable instructions is copied into 
the preparation sequence, along with the destination address for resolved 
operands. 

2. A REPLACE instruction with appropriate operands is inserted just after the 
last two-modes-inlinable instruction. 

3. A final GOTO instruction with appropriate operand is added at the end of the 
preparation sequence. 

The motivation for adding the replace instruction just after the the last two- 
modes-inlinable instruction, is that it is the earliest safe place to do so. Replacing 
sooner could cause the execution (on another Java thread) of the fast version 
of an upcoming two-modes instruction before it is actually prepared. Replacing 
later can also be a problem, specially if some upcoming inlinable instruction 
is a conditional (or unconditional) branch instruction. This is because, if the 
branch is taken, then single-value r^lacement will not take place, forcing the 
next execution to take the slow path|j. 

The construction of an inlined instruction sequence containing two-modes- 
inlinable instructions proceeds as follows: 

1. The body of the SKIP instruction is copied at the beginning of the sequence 
implementation. 

2. Then, all instruction bodies are sequentially copied. 

3. Finally, the body of the DISPATCH instruction is copied at the end of the 
sequence implementation. 

Note that a single preparation sequence can contain multiple two-modes in- 
structions. Yet, on the fast execution path, there is a single program-counter 
increment (i.e. SKIP body) per inlined instruction sequence. 

5 Experimental Results 

We have implemented 3 flavors of threaded interpretation, in the Sable VM frame- 
work [^: switch-threading, direct-threading and inline-threading. Switch-threa- 
ding differs from simple switch-based bytecode interpretation in that it is ap- 
plied on a prepared code array of word-size elements. To avoid the two-values 
replacement problem, single-instruction preparation sequences are in use within 
the switch-threaded and direct-threaded engines. We have performed execution 
time measurements with SableVM to measure the efficiency of inline-threading 

® Multiple executions of the same preparation sequence is allowed, but suffers from high 
dispatch overhead. It can happen in the normal operation of the inline-threaded 
interpreter as the result of an exception thrown before single- value replacement, 
while executing a preparation sequence. 
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Java, using our technique. We have performed our experiments on a 1.5 GHz 
Pentium IV based Debian GNU/Linux workstation with 1.5 Gb RAM, and a 
7200 RPM disk, running SPECjvm98 benchmarks and two object-oriented ap- 
plications: Soot version 1.2.30 and SableGG version 2.17. 

In a first set of experiments, we have measured the relative performance of 
the switch-threaded, direct-threaded and inline-threaded engines. Results are 
shown in Table [TJ To do these experiments, three separate versions of SableVM 
were compiled with identical configuration options, except for the interpreter 
engine type. 



Table 1. Inline-Threading Performance Measurements 



benchmark 


switch- 

threaded 


direct- 

threaded 


inline- 

threaded 


compress 


317.72 sec. 


281.78 sec. (1.13) 


131.64 sec. (2.41) (2.14) 


db 


132.15 sec. 


119.17 sec. (1.11) 


87.64 sec. (1.51) (1.36) 


jack 


45.65 sec. 


46.78 sec. (0.98) 


38.16 sec. (1.20) (1.23) 


javac 


110.10 sec. 


105.24 sec. (1.05) 


89.37 sec. (1.23) (1.17) 


jess 


74.79 sec. 


68.12 sec. (1.10) 


53.57 sec. (1.40) (1.27) 


mpegaudio 


285.77 sec. 


242.90 sec. (1.18) 


136.97 sec. (2.09) (1.77) 


mtrt 


142.87 sec. 


115.34 sec. (1.24) 


100.39 sec. (1.42) (1.15) 


raytrace 


166.19 sec. 


134.06 sec. (1.24) 


113.55 sec. (1.46) (1.18) 


soot 


676.06 sec. 


641.96 sec. (1.05) 


548.13 sec. (1.23) (1.17) 


sablecc 


40.12 sec. 


36.95 sec. (1.09) 


26.09 sec. (1.54) (1.41) 



Columns of Table [U contain respectively: (a) the name of the executed bench- 
mark, (b) the execution time in seconds using the switch-threaded engine, (c) the 
execution time in seconds using the direct-threaded engine, and the speedup over 
the switch-threaded engine in parentheses, and (d) the execution time in seconds 
using the inline-threaded engine, and the speedup over both switch-threaded and 
direct-threaded engines respectively in parentheses. 

The Inline-threaded engine does deliver significant performance improvement. 
It achieves a speedup of up to 2.41 over the switch-threaded engine. The smallest 
measured speedup, over the fastest of the two other engines on a benchmark, is 
of 1.15 on the mtrt benchmark, where it still delivers a speedup of 1.42 over the 
second engine. 

It is important to note that the switch-threaded engine already has some 
advantages over a pure switch-based bytecode interpreter. It benefits from word 
alignment and other performance improving features of the Sable VM framework. 
So, it is likely that the performance gains of inline-threading over pure bytecode 
interpretation are even bigger than those measured against switch-threading. 

In a second set of tests, we measured the performance improvement due 
to the inlining of two-modes instructions (e.g. GETSTATIC), within the inlined- 

® http: //www. sable .mcgill . ca/soot/ 
http : / / WWW . sablecc . org/ 
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threaded engine. To do so, we compiled a version of SableVM with a special 
option that prevents inlining of two- modes instructions, and compared its speed 
to the normal inline-threaded engine. Results are shown in Table [2j 



Table 2. Preparation Sequences Performance Measurements 





shorter 


fhn 




benchmark 


sequences 


sequences 


speedup 


compress 


195.50 sec. 


131.64 sec. 


1.49 


db 


108.22 sec. 


87.64 sec. 


TM 


jack 


40.46 sec. 


38.16 sec. 


1.06 


javac 


99.99 sec. 


89.37 sec. 


1.12 


jess 


62.91 sec. 


53.57 sec. 


1.17 


mpegaudio 


157.38 sec. 


136.97 sec. 


1.16 


mtrt 


105.39 sec. 


100.39 sec. 


1.05 


raytrace 


133.12 sec. 


113.55 sec. 


1.17 


soot 


617.42 sec. 


548.13 sec. 


1.13 


sablecc 


32.35 sec. 


26.09 sec. 


1.24 



Columns of Table |2] contain respectively: (a) the name of the executed bench- 
mark, (b) the execution time in seconds using the special inline-threaded engine 
that does not inline two-modes instructions, (c) the execution time in seconds us- 
ing the normal inline-threaded engine implementing full preparation sequences, 
and (d) the speedup achieved by the normal inline-threaded engine over the 
atrophied version. 

Our performance measurements show that the speedup due to longer se- 
quences ranges between 1.05 and 1.49, which is quite significant. 



6 Related Work 

The most closely related work to the work of this paper is the work of I. Pin- 
mart a and F. Riccardi in m- We have already discussed the inline-threading 
technique introduced in this paper in Section |3 Our work builds on top of this 
work, by introducing techniques to deal with multi-threaded execution environ- 
ments, and inlining of two-modes instructions. Inline-threading, in turn, is the 
result of combining the Forth-like threaded interpretation technique j5] (which 
we have already discussed in Section [2|) with the idea of template-based dynamic 
compilation PHo]. The main advantage of inline-threading over that of template 
based compilation is its simplicity and portability. 

A related system for dynamic code generation is that of veode, introduced 
by D. Engler [3j. The veode system is an architecture-neutral runtime assembler. 
It can be used for implementing just-in-time compilers. It is in our future plans 
to experiment with veode for constructing an architecture-neutral just-in-time 
compiler for SableVM^ offering an additional choice of performance-portability 
tradeoff. 
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Other closely related work is that of dynamic patching. The problem of po- 
tential high cost synchronization costs for concurrent modification of executed 
code is also faced by dynamically adaptive Java systems. In |^, M. Cerniac et 
al. describe a technique for dynamic inline patching (a similar technique is also 
described in 0). The main idea is to store a self-jump (a jump instruction to it- 
self) in the executable code stream before proceeding with further modifications 
of the executable code. This causes any concurrent thread executing the same 
instruction to spin-wait for the completion of the modification operation. 

Our technique of using explicit synchronization in preparation sequences and 
single value replacement has the marked advantage of causing no spin- wait. 
Spinning can have, in some cases, a highly undesirable side effect, that of almost 
dead-locking the system when the spinning thread has much higher priority 
than the code patching thread. This is because, while it is spinning, the high 
priority does not make any progress in code execution and, depending on the 
thread scheduling policy of the host operating system, might be preventing the 
patching thread from making noticeable progress. 

7 Conclusions 

In this paper we have explained the difficulty of using the inline-threaded inter- 
pretation technique in a Java interpreter. Then, we introduced a new technique, 
preparation sequences^ that not only makes it possible, but also effective. This 
technique uses efficient single-word replacement for managing lazy class-loading 
and preparation in a multi-threaded environment, and increases the length of 
inlined instruction sequences^ reducing dispatch overhead. We then presented our 
experimental results, showing that an inline-threaded interpreter engine, imple- 
menting our technique, achieves significant performance improvements over that 
of switch-threaded and direct-threaded engines. Our results also show that longer 
inlined instructions sequences, due solely to preparation sequences, can yield a 
speedup ranging between 1.05 and 1.49. 
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Abstract. We study an incorporation of generations into a modern ref- 
erence counting collector. We start with the two on-the-fly collectors 
suggested by Levanoni and Petrank: a reference counting collector and 
a tracing (mark and sweep) collector. We then propose three designs 
for combining them so that the reference counting collector collects the 
young generation or the old generation or both. Our designs maintain the 
good properties of the Levanoni- Petrank collector. In particular, it is ade- 
quate for multithreaded environment and a multiprocessor platform, and 
it has an efficient write barrier with no synchronization operations. To 
the best of our knowledge, the use of generations with reference counting 
has not been tried before. 

We have implemented these algorithms with the Jikes JVM and com- 
pared them against the concurrent reference counting collector supplied 
with the Jikes package. As expected, the best combination is the one 
that lets the tracing collector work on the young generation (where 
most objects die) and the reference counting work on the old generation 
(where many objects survive). Matching the expected survival rate with 
the nature of the collector yields a large improvement in throughput 
while maintaining the pause times around a couple of milliseconds. 

Keywords: Runtime systems. Memory management. Garbage collec- 
tion, Generational Garbage Gollection. 



1 Introduction 

Automatic memory management is well acknowledged as an important tool for 
fast development of large reliable software. It turns out that the garbage col- 
lection process has an important impact on the overall runtime performance. 
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Thus, clever design of efficient memory management and garbage collection is 
an important goal in today’s technology. 



1.1 Reference Counting 

Reference counting is the most intuitive method for automatic storage manage- 
ment known since the sixties (c.f. [S].) The main idea is that we keep for each 
object a count of the number of references to the object. When this number be- 
comes zero for an object o, we know that o can be reclaimed. Reference counting 
seems very promising to future garbage collected systems, especially with the 
spread of the 64 bit architectures and the increase in usage of very large heaps. 
Tracing collectors must traverse all live objects, and thus, the bigger the usage 
of the heap (i.e., the amount of live objects in the heap), the more work the 
collector must perform. Reference counting is different. The amount of work is 
proportional to the amount of work done by the user program between collec- 
tions plus the amount of space that is actually reclaimed. But it does not depend 
on the space consumed by live objects in the heap. 

Historically, the study of concurrent reference counting for modern multi- 
threaded environments and multiprocessor platforms has not been as extensive 
and thorough as the study of concurrent and parallel tracing collectors. However, 
recently, we have seen several studies and implementations of modern reference 
counting algorithms on modern platforms building on and improving on previous 
work. Levanoni and Petrank following DeTreville [2] have presented an on- 
the-fly reference counting algorithms that overcome the concurrency problems of 
reference counting. Levanoni and Petrank have completely eliminated the need 
for synchronization operations in the write barrier. In addition, the algorithm 
of Levanoni and Petrank drastically reduces the number of counter updates (for 
common benchmarks). 

1.2 Generational Collection 

Generational garbage collection was introduced by Lieberman and Hewitt [18], 
and the first published implementation was by Ungar |^. Generational garbage 
collectors rely on the assumption that many objects die young. The heap is parti- 
tioned into two parts: the young generation and the old generation. New objects 
are allocated in the young generation, which is collected frequently. Young ob- 
jects that survive several collections are “promoted” to the older generation. If 
the generational assumption (i.e., that most objects die young) is indeed correct, 
we get several advantages. Pauses for the collection of the young generation are 
short; collections are more efficient since they concentrate on the young part of 
the heap where we expect to find a high percentage of garbage; and finally, the 
working set size is smaller both for the program (because it repeatedly reuses 
the young area) and for the collector (because most of the collections trace over 
a smaller portion of the heap). 

Since in this paper we discuss an on-the-fly collector, we do not expect to see 
reduction of the pause time: they are extremely low already. Our goal is to keep 
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the low pauses of the original algorithm. However, increased efficiency and better 
locality may give us a better overall collection time and a better throughput. 
This is indeed what we achieve. 



1.3 This Work 

In this work, we study how generational collection interacts with reference count- 
ing. Furthermore, we employ a modern reference counting algorithm adequate 
for running on a modern environment (i.e., multithreaded) and modern platform 
(i.e., multiprocessor). We study three alternative uses of reference counting with 
generations. In the first, both the young and the old generations are collected 
using reference counting. In the second, the young generation is collected via 
reference counting and the collector of the old generation is a mark- and- sweep 
collector. The last alternative we explore is a use of reference counting to collect 
the old generation and mark- and- sweep to collect the young generation. As build- 
ing blocks, we use the Levanoni- Pet rank sliding view collectors m the reference 
counting collector and the mark- and- sweep collector. Our new generational col- 
lectors are on-the-fiy and employ a write barrier that uses no synchronization 
operation (like the original collectors). 

Note that one combination is expected to win the race. Normally, the per- 
centage of objects that survive is small in the young generation and high in the 
old generation. If we look at the complexity of the involved algorithms, reference 
counting has complexity related to the number of dead objects. Thus, it matches 
the death rate of the old generation. Tracing collectors do better when most ob- 
jects die - thus, they match the death rate of the young generation. Indeed the 
combination employing tracing for the young generation and reference counting 
for the old yields the best results. 

In addition to the new study of generations with reference counting, our work 
is also interesting as yet another attempt to run generations with an on-the-fiy 
collector. The only other work that we are aware of that uses generations with 
an on-the-fiy collector is the work of Domani, Kolodner, and Petrank in which 
generations are used with a mark and sweep collector [l5j3. 



1.4 Generational Collection without Moving Objects 

Usually, on-the-fiy garbage collectors do not move objects; the cost of moving 
objects while running concurrently with program threads is too high. Demers, 
et al. PI presented a generational collector that does not move objects. Their 
motivation was to adapt generations for conservative garbage collection. Here 

^ A partial incorporation of generations with an mark and sweep collector, used only 
for immutable objects was used by Doligez, Leroy and Gonthier [ma. The whole 
scheme depends on the fact that many objects in ML are immutable. This is not true 
for Java and other imperative languages. Furthermore, the collection of the young 
generation is not concurrent. Each thread has its own private young generation (used 
only for immutable objects), which is collected while that thread is stopped. 
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we exploit their ideas: instead of partitioning the heap physically and keeping 
the young objects in a separate area we partition the heap logically. For each 
object, we keep one bit indicating if it is young or old. 

1.5 Implementation and Results 

We have implemented our algorithms on dikes - a Research Java Virtual Machine 
version 2.0.3 (upon Linux Red-Hat 7.2). The entire system, including the collec- 
tor itself is written in Java (extended with unsafe primitives to access raw mem- 
ory). We have taken measurements on a 4- way IBM Netfinity 8500R server with 
a 550MHz Intel Pentium III Xeon processor and 2GB of physical memory. The 
benchmarks used were the SPECjvm98 benchmark suite and the SPECjbb2000 
benchmark. These benchmarks are described in detail in SPEC’s Web site[^. In 
Section [5] we report the measurements we ran with our collectors. We tested our 
new collectors against the dikes concurrent collector distributed with the dikes 
Research Java Virtual Machine package. This collector is a reference counting 
concurrent collector developed at IBM and reported in [3]. Our most efficient 
collector (the one that uses reference counting for the old generation) achieves 
excellent performance measures. The throughput is improved by up to 40% for 
the SPECjbb2000 benchmark. The pauses are also smaller. These results hold for 
the default heap size of the benchmarks. Running the collectors on tight heaps 
show that our generational collector is not suitable for very small heaps. In such 
conditions, the original dikes algorithm performs better. A possible explanation 
to this phenomena is that reference counting is more efficient than the tracing 
collection (of the young generation) when the collections are too frequent. In 
this case, the tracing collector must trace the live (young) objects repeatedly, 
whereas the reference counting only spends time proportional to the work done 
in between the collections. 

1.6 Cycle Collection 

A major disadvantage of reference counting is that it does not collect cycles. If 
the old generation is collected with a mark- and- sweep collector, there is no issue, 
since the cycles will be collected then. When reference counting is used for the 
old generation we also use the mark- and- sweep collector occasionally to collect 
the full heap and reclaim garbage cycle^. 

1.7 Organization 

In Section[2|we review reference counting developments through recent years and 
mention related work. In section [3] we present the Levanoni- Pet rank collectors 
we build on. In section |4| we present the generational algorithms. In section in 
section O we discuss our implementation and present our measurements. We 
conclude in section E] 

^ Another option was to use the cyclic structures collector of Bacon and Raj an |3] but 
from their measurements it seems that tracing collectors should be more efficient. 
Thus, we chose to use the readily available tracing collector. 
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2 An Overview on Reference Counting Algorithms 



The traditional method of reference counting, was first developed for Lisp by 
Collins [H|. The idea is to keep a reference count field for each object telling how 
many references exist to the object. Whenever a pointer is updated the system 
invokes a write harrier that keeps the reference counts updated. In particular, if 
the pointer is modified from pointing to 0\ into pointing to O 2 then the write 
barrier decrements the count of 0\ and increments the count of 02- When the 
counter of an object is decreased to zero, it is reclaimed. The reference counts 
of all its predecessors (its children values at the previous sliding- view) are then 
decremented as well and the reclamation may continue recursively. Improvements 
to the naive algorithm were suggested in several subsequent papers. Weizman ES] 
studied ameliorating the delay introduced by recursive deletion. Several works 
mm use a single bit for each reference counter with a mechanism to handle 
overflows. The idea being that most objects are singly-referenced, except for the 
duration of short transitions. 

Deutsch and Bobrow [Toj noted that most of the overhead on counter updates 
originates from the frequent updates of local references (in stack and registers). 
They suggested to use the write barrier only for pointers on the heap. Now, when 
a reference count decreases to zero, the object can not be reclaimed since it may 
still be reachable from local references. To collect objects, a collection is invoked. 
During the collection one can reclaim all objects with zero heap reference count 
that are not accessible from local references. Their method is called deferred 
referenee eounting and it yields a great saving in the write barrier overhead. It 
is used in most modern reference counting collectors. In particular, this method 
was later adapted for Modula-2-f [^. Further study on reducing work for local 
variables can be found in [Hj and P!- 

Reference counting seemed to have an intrinsic problem with multithreading 
implying that a semaphore must be used for each pointer update. The problems 
were dealt with a series of paper [I9|2()|3|17] . The sliding views algorithm of Lev- 
anoni and Petrank m presented a reference counting collector that completely 
eliminated the need for a synchronization operation in the write barrier. In this 
work, we use the sliding views algorithms as the basic clock for the generational 
algorithms. A detailed description of the Levanoni Petrank collectors follow. 



3 The Levanoni-Petrank Collectors 

In this section we provide a short overview of the Levanoni-Petrank collectors. 
Due to space limitations we omit the pseudo code. More details appear in our 
technical report [T]. The full algorithm is described in the original paper m- 

3.1 The Sliding- View Reference Counting Algorithm 

The Levanoni-Petrank collectors [a are based on computing differences between 
heap snapshots. The algorithms operate in cycles. A cycle begins with a collec- 
tion and ends with another. Let us describe the collector actions during cycle 
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k. Using a write barrier, the mutators records all heap objects whose pointer 
slots are modified during cycle k. The recorded information is the address of 
the modified object as well as the values of the object’s pointer slots before the 
current modification. A dirty flag is used to let only one record be kept for any 
modified slot. The analysis shows that (infrequent) races may cause more than 
one record be created for an object, but all such records contain essentially the 
same information. The records are written into a local buffer with no synchro- 
nization. The dirty flag is actually implemented as a pointer, being either null 
when the flag is clear, or a pointer o. Log Pointer to the logging location in the 
local buffer if the flag is set. 

All created objects are marked dirty during creation. There is no need to 
record their slots values as they are all null at creation time (and thus, also 
during the previous collection). But objects that will be referenced by these 
slots during the next collection must be noted and their reference counts must 
be incremented. 

A collection begins by taking a sliding- view of the heap. A sliding- view is 
essentially a non-atomic snapshot of the heap. It is obtained incrementally, i.e. 
the mutators are not stopped simultaneously. A snooping mechanism is used 
to ensure that the sliding view of the heap does not confuse the collector into 
reclaiming live objects: while the view is being read from the heap, the write- 
barrier mark any object that is assigned a new reference in the heap. These 
objects are marked as Snooped by ascribing them to the threads’ local buffer: 
Snoopedi^ thus, preventing them from being collected in this collection cycle 
mistakenly. 

Getting further into the details, the Levanoni- Pet rank collector employs four 
handshakes during the collection cycle. The collection starts with the collector 
raising the Snoopi flag of each thread, signaling to the mutators that it is about 
to start computing a sliding-view. During the first handshake, mutator local 
buffers are retrieved and then are cleared. The objects which are listed in the 
buffers are exactly those objects that have been changed since the last cycle. 
Next, the dirty flags of the objects listed in the buffers are cleared while the 
mutators are running. This step may clear dirty marks that have been concur- 
rently set by the running mutators. The logging in the threads’ local buffers 
is being used in order to keep these dirty bits set in the second handshake. 
The third handshake is carried out to assure the reinforcement is visible to all 
mutators. During the fourth handshake threads local states are scanned and ob- 
jects directly reachable from the roots are marked as Roots. After the fourth 
handshake the collector proceeds to adjust re fields due to differences between 
the sliding views of the previous and current cycle. Each object which is logged 
to one of the mutator’s local buffers was modified since the previous collection 
cycle, thus we need to decrement the re of its slots values in the previous sliding- 
view and increment the re of its slots values in the current sliding- view. The re 
decrement operation of each modified object is done using the objects’ replica 
at the retrieved local buffers. Each object replica contains the object slots’ value 
at the time the previous sliding-view was taken. The re increment operation of 
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each modified object is more complicated as the mutators can change the cur- 
rent sliding-view values of the object’s slots while the collector tries to increment 
their rc field. This race is solved by taking a replica of the object to be adjusted 
and committing it. First, we check if the object’s dirty flag, o.LogPointer, is 
set. If it is set it points already to a committed replica (taken by some mutator) 
of the object’s slots at the time the current sliding-view was taken. Otherwise, 
we take a temporary replica of the object and commit it by checking afterwards 
that the object’s dirty flag is still not set. If it is committed the replica contains 
the object’s slots value at the time the current sliding- view was taken and can be 
used to increment the rc of the object’s slots value. Otherwise, if the dirty flag 
is set, we use the replica pointed by the set dirty flag in order to adjust the rc 
of the object’s slots. A collection cycle ends with reclamation which recursively 
free any object with zero rc field which is not marked as local. 



3.2 The Sliding-View Tracing Algorithm 

“Snapshot at the beginning” |l^ mark&sweep collectors exploit the fact that a 
garbage object remains garbage until the collector recycles it. i.e., being garbage 
is a stable property. The Levanoni- Pet rank sliding-view tracing collector takes 
the idea of “Snapshot at the beginning” one logical step further and show how 
it is possible to trace and sweep given a “sliding view at the beginning”. The 
collector computes a sliding- view exactly as in the previous reference counting 
algorithm. After the Mark-Roots stage, the collector starts tracing according to 
the sliding view associated with the cycle. When in needs to trace through an 
object the collector tries to determine its value in the sliding view as was done 
in the previous algorithm, i.e. by checking if the object’s LogPointer (the dirty 
flag) is set. If it is set each object’s slot sliding- view value can be found directly 
from the already committed (by some mutator) replica which is pointed to by 
the object’s LogPointer. If it is not set, a temporary replica of the object is 
taken and is committed by checking again if the object’s dirty flag is still not 
set. If the replica is committed the collector continues by tracing through the 
object’s replica. Finally, the collector proceeds to reclaim garbage objects by 
sweeping the heap. 

4 The Generational Collectors 

In this section we describe the collectors we have designed. Due to lack of space, 
we concentrate on the winning collector. The description of the other two col- 
lectors appears in our technical report [T]. 

Our generational mechanism is simple. The young generation holds all objects 
allocated since the previous collection and each object that survives a young 
(or full) collection is immediately promoted to the old generation. This naive 
promotion policy fits nicely into the algorithms we use. Recall that generations 
are not segregated in the heap since we do not move objects in the heap. In 
order to quickly determine if an object is young or old, we keep a bitmap (1 bit 
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for each 8 bytes) telling which objects are old. All objects are created young and 
promotion modifies this bit. By the experience of Domani et al [15] we believe 
that spending more collection efforts on an aging mechanism does not pay. See 
for more the details of this experience. 



4.1 Reference Counting for the Pull Collection 

Here, we describe the algorithm that worked best: using reference counting for 
the full collections and tracing (mark- and- sweep) for the minor collections. 

The minor (mark and sweep) collection. The mark and sweep minor col- 
lection marks all reachable young objects at the current sliding view and then 
sweeps all young unmarked objects. The young generation contains all the ob- 
jects that were created since the previous collection cycle and were logged by 
the i-th mutator to its local Young-ObjectSi buffer. These local buffers hold 
addresses of all newly created objects since the recent collection and can be also 
viewed as holding pointers to all objects in the young generation to be processed 
by the next collection. These buffers are retrieved by the collector in the first 
handshake of the collection and their union Young- Objects buffer of the collector 
is the young generation to be processed (swept) in this minor collection cycle. 

Recall that we are using the Levanoni- Pet rank sliding view collectors as the 
basis for this work. The sliding view algorithm uses a dirty flag for each object 
to tell if it was modified since the previous collection. All modified objects are 
kept in a Updates buffer (which is essentially the union of all mutator’s local 
buffers) so that the re fields of objects referenced by these objects’ slots can later 
be updated by the collector. Since we are using the naive promotion policy, we 
may use these buffers also as our remembered set: The young generation con- 
tains only objects that have been created since the last collection, thus it follows 
that inter-generational pointers may only be located in pointer slots that have 
been modified since the last collection. Clearly, objects in the old generation that 
point to young objects must have been modified since the last collection cycle, 
since the young objects did not exist previous to this collection. Thus, the ad- 
dresses of all the inter-generational pointers must appear in the Updates buffer 
of the collector at this collection cycle. At first glance it may appear that this is 
enough. However, the collection cycle is not atomic in the view of the program. 
It runs concurrently with the run of the program. Thus, referring to the time 
of the last collection cycle is not accurate. During the following discussion, we 
assume that the reader is familiar with the Levanoni-Petrank original col- 
lectors. There are two cases in which inter- generational pointers are created but 
do not appear in the Updates buffer read by the collector in the first handshake. 



Case 1: Mutator Mj creates a new object O after responding to the first hand- 
shake. Later, Mutator Mi, who has not yet seen the first handshake executes 
an update operation assigning a pointer in the old generation to reference the 
object O. In this case, an inter-generational pointer is created: the object O 
was not reported to the collector in the first handshake and thus, will not be 
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reclaimed or promoted in the current collection. It will be reported as a young 
object to the collector only in the next collection. But the update is recorded 
in the current collection (the update was executed before the first handshake in 
the view of Mutator Mi) and will not be seen in the next collection. Thus, an 
inter-generational pointer will be missing from the view of the next collection. 

Case 2: Some mutator updates a pointer slot in an object O to reference a 
young object. The object O is currently dirty because of the previous collection 
cycle, i.e., the first handshake has occurred, but the clear dirty flags operation 
has not yet executed for that object. In this case, an inter-generational pointer 
is created but it is not logged to the i-th mutator Updates local buffer. Indeed, 
this pointer slot must appear in the Updates buffer of the previous collection 
and correctness of the original algorithm is not foiled, yet in the next cycle the 
Updates buffer might not contain this pointer, thus an inter-generational pointer 
may be missing from the view of next collection. 

In order to correctly identify inter-generational pointers that are created in 
one of the above two manners, each minor collection records into a special buffer 
called IGP-Buffer, all the addresses of objects that had to do with updates to a 
young objects in the uncertainty period from before the first handshake has be- 
gun until after the clear dirty flags operation is over for all the modified(logged) 
objects. The next collection cycle will use that IGP-Buffer buffer that was ap- 
pended in the previous collection cycle as its PrevIGP- Buffer buffer in order to 
scan the potential inter-generational pointers that might have not appeared in 
the Updates buffer. In this way, we are sure to have all inter-generational pointers 
covered for each minor collection. 

Finally, we note that the sweep phase processes only young objects. It scans 
each object’s color in the Young-Buffer. Objects which are marked with white 
color are reclaimed, otherwise, they are promoted by setting their old flag as 
true. 

The full (reference counting) collection. The Major- Young- Objects and 
Major-Updates buffers are full collection buffers that correspond to the minor 
collection’s Young-Objects and Updates buffers. These buffers are prepared by the 
minor collections to serve the full collection. Only those objects which promoted 
by the minor collections should be logged to the major buffers as these objects 
will live till the next full collection. The minor collection avoids repetition in these 
buffers using an additional bitmap called Log gedTo Major Buffers. Other than 
the special care required with the buffers, the major collection cycle is similar 
to the original reference counting collector besides. The re field adjustments 
are executed for each modified object, thus, logged to Major-Updates buffer 
or Updates buffer. The re of the object’s previous sliding- view slots values is 
decremented and the object’s current sliding- view slots values re is incremented. 
As for young objects, the same procedure needs only to increment the re fields of 
the current sliding- view slots values for each young object, thus, logged to Major- 
Young-Objects or Young -Objects. No decrement operation should be taken on the 
re field of Young-Objects objects slots because their object did not exist in the 
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previous collection cycle and was created only afterwards and their value then 
was null. 

Using deferred reference counting (El following [1Q]), we employ a zero count 
table denoted ZCT to hold each young object whose count decreases to zero 
during the counter updates. All these candidates are checked after all the updates 
are done. If their reference count is still zero and they are not referenced from 
the roots, then they may be reclaimed. Note that all newly created (young) 
objects must be checked since they are created with reference count zero. (They 
are only referenced by local variables in the beginning.) Thus, all objects in the 
Young- Objects as well as in the Major- New- Objects buffer are appended to the 
ZCT that is reclaimed by the collector. 

The inability of reference counters algorithms to reclaim cyclic structures 
is being treated with an auxiliary mark- and- sweep algorithm used infrequently 
during the full collection. 

5 Implementation and Results 

We have implemented our collectors into dikes. We have decided to use the non- 
copying allocator of dikes, which is based on the allocator of Boehm Demers and 
Shenker [Ij. This allocator is suitable for collectors that do not move objects. 
It keeps the fragmentation low and allows both efficient sporadic reclamation of 
objects (as required by the reference counting) and efficient linear reclamation 
of objects (as required by the sweep procedure). A full heap collection will be 
triggered when the amount of available memory drops below a predefined thresh- 
old. A minor heap collection will be triggered after every 200 new allocator-block 
allocations. This kind of triggering strategy emulates allocations from a young 
generation whose size is limited. 

We have taken measurements on a 4-way IBM Netfinity 8500R server with 
550MHz Intel Pentium III Xeon processors and 2GB of physical memory. We 
also measured the run of our collector on a client machine: a single 550MHZ 
Intel Pentium III processor and 2GB of physical memory. The benchmarks we 
used were the SPEGjvm98 benchmark suite and the SPEGjbb2000 benchmark. 
These benchmarks are described in detail in SPEG’s Web site [2^. 

The dikes concurrent collector. Our collectors measurements are compared 
with the concurrent reference counting collector supplied with the dikes package 
and reported in j3]. The dikes concurrent collector is an advanced on-the-fly pure 
reference-counting collector and it has similar characteristics as our collectors, 
namely, the mutators are loosely synchronized with the collector, allowing very 
low pause times. 

Testing procedure. We used the benchmark suite using the test harness, per- 
forming standard automated runs of all the benchmarks in the suite. Our stan- 
dard automated run runs each benchmark five times for each of the dVM’s 
involved (each implementing a different collector). To get an additional multi- 
threaded benchmark, we have also modified the _227_mtrt benchmark from the 
SPEGjvm98 suite to run on a varying number of threads. We measured its run 
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Fig. 1. Max pause time measurements for SPECjvm98 and SPECjbb2000 benchmarks 
on a multiprocessor. SPECjbb2000 was measured with 1, 2, and 3 warehouses. 



with 2, 4, 6, 8 and 10 threads. Finally, to understand better the behavior of these 
collectors under tight and relaxed conditions, we tested them on varying heap 
sizes. For the SPECjvm98 suite, we started with a 32MB heap size and extended 
the sizes by SMB increments until a final large size of 96MB. For SPECjbb2000 
we used larger heaps, as reported in the graphs. In the results, we concentrate 
on the best collector, i.e., the collector that uses reference-counting for the full 
collection. In our technical report [l] we provide measurements for all collectors. 
Also, in our technical report, we provide a systematic report on how we selected 
our parameters, such as the triggering policy, the allocator parameters, the size 
of the young generation, etc. We omit these reports from this short paper for 
lack of space. 



SPECjvm98 - Multiprocessor 




SPECjvm98 - Uniprocessor 




Fig. 2. Running time ratios (dikes- Concurrent /Generational) for the SPECjvm98 suite 
with varying heap sizes. The graph on the left shows results on a multiprocessor and 
the graph on the right reports results for a uniprocessor. 

Server measurements. The SPECjvm98 benchmarks (and so also the 
_227_mtrt modified benchmark) provide a measure of the elapsed running time 
which we report. We report in figure [3] the running time ratio of our collector 
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SPECjbbOO - Throughput Ratio _227_mtrt - Multiprocessor 




Fig. 3. The graph on the left shows SPEC_jbb2000 throughput ratios 
(Generational/Jikes-Concurrent) on a multiprocessor and the graph on the right 
reports running time ratio (dikes- Concurrent /Generational) for the _227_mtrt 
benchmarks on a multiprocessor. 

and the dikes concurrent collector. The higher the number, the better our col- 
lector performs. In particular, a value above 1 means our collector outperforms 
the dikes concurrent collector. 

We ran each of the SPECjvm98 benchmarks on a multiprocessor, allowing a 
designated processor to run the collector thread. We report these results in figure 
|2] (graph on left). These results demonstrate performance when the system is not 
busy and the collector may run concurrently on an idle processor. In practically 
all measurements, our collector did better than the dikes concurrent collector, 
up to an improvement of 48% for _202_jess on small heaps. The behavior of 
the collector on a busy system may be tested when the number of application 
threads exceeds the number of (physical) processors. A special case is when the 
dVM is run on a uniprocessor. In these cases, the efficiency of the collector is 
important: the throughput may be harmed when the collector spends too much 
CPU time. We have modified the _227_mtrt benchmark to work with varying 
number of threads (4, 6, 8, 10 threads) and the resulting running time measures 
are reported in the right graph of figure O The measurements show an improved 
performance for almost all parameters with typical to large heaps, with the 
highest improvement being 30% for _227_mtrt with 6 threads and heap size 
96MBytes. However, on small heaps the dikes concurrent collector does better. 

The results of SPECjbb2000 are measured a bit differently. The run of 
SPECjbb2000 requires a multi-phased run with an increasing number of threads. 
Each phase lasts for two minutes with a ramp-up period of half a minute before 
each phase. Again, we report the throughput ratio improvement. Here the result 
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SPECjvm98 - Multiprocessor SPECjbbOO - Multiprocessor 




Fig. 4. The results of the second generational algorithm which uses reference counting 
for the minor generation. The graph on the left shows SPEC_jvm98 running time ra- 
tios (dikes- Concurrent /Generational) on a multiprocessor and the graph on the right 
reports throughput ratio (Generational/ dikes- Concurrent) for SPEC_jbb2000 on a mul- 
tiprocessor. 

is throughput and not running-time. For clarity of representation, we report the 
inverse ratio, so that higher ratios still show better performance of our collector, 
and ratios larger than 1 imply our collector outperforming the dikes concurrent 
collector. The measurements are reported for a varying number of threads (and 
varying heap sizes) in the left graph of Figure [31 When the system has no idle 
processor for the collector (4,6, and 8 warehouses), our collector clearly outper- 
forms the dikes concurrent collector. The typical improvement is 25% and the 
highest improvement is 45%. In the case in which 2 warehouses are run and the 
collector is free to run on an idle processor, our collector performs better when 
the heap is not tight, whereas on tighter heaps, the dikes concurrent collector 
wins. 

The maximum pause times for the SPECjvm98 benchmarks and the 
SPECjbb2000 benchmark are reported in figure [TJ The SPECjvm98 bench- 
marks were run with heap size 64MBytes and those of SPECjbb2000 (with 1,2,3 
threads) with heap size 256MBytes. Note that if the number of threads exceed 
the number of processors, then long pause times appear because threads lose 
the CPU to other mutators or the collector. Hence the reported settings. It can 
be seen that the maximum pause times (see figure [TJ are as low as those of the 
dikes concurrent collector and they are all below 5ms. 

We go on with a couple of graphs presenting measurements of the second 
best collector: the one that runs reference counting for the young generation and 
mark and sweep for the full collection. In figure [T] we report the running time and 
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throughput ratio of this collector. As seen from these graphs this collector does 
not perform significantly worse. In most measurements, it did better than the 
dikes concurrent collector, up to an improvement of 50% for _202_jess on small 
heaps and 25% for the SPECjbb2000 benchmark with 8 number of threads. More 
measurements appear in our technical report. 

Client measurements. Finally, we have also measured our generational col- 
lector on a uniprocessor to check how it handles a client environment with the 
SPECjvm98 benchmark suite. We report the uniprocessor tests in figure 
(graph on right). It turns out that the generational algorithm is better than 
the dikes concurrent collector in almost all tests. Note the large improvement of 
around 60% for the _202_jess benchmark. 

6 Conclusions 

We have presented a design for integrating generations with an on-the-fly ref- 
erence counting collector: using reference counting for the full collection and 
mark and sweep for collecting the young generation. A tracing collector is infre- 
quently used to collect cyclic garbage structures. We used the Levanoni- Pet rank 
sliding view collectors as the building blocks for this design. The collector was 
implemented on dikes and was run on a 4- way IBM Netfinity server. 

Our measurements against the dikes concurrent collector show a large im- 
provement in throughput and the same low pause times. The collector presented 
here is the best among the three possible incorporation of generations into ref- 
erence counting collectors. 
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Abstract. The express-lane transformation isolates and duplicates frequently ex- 
ecuted program paths, aiming for better data-flow facts along the duplicated paths. 
An express-lane p is a copy of a frequently executed program path such that p has 
only one entry point at its beginning; p may have branches back to the original 
code, but the original code never branches into p. Classical data-flow analysis is 
likely to And sharper data-flow facts along an express-lane, because there are no 
join points. 

This paper describes several variants of interprocedural express-lane transforma- 
tions; these duplicate hot interprocedural paths, i.e., paths that may cross procedure 
boundaries. The paper also reports results from an experimental study of the effects 
of the express-lane transformation on interprocedural range analysis. 



1 Introduction 

In path profiling, a program is instrumented with code that counts the number of times 
particular finite-length path fragments of the program’s control-flow graph — or observ- 
able paths — are executed. One application of path profiling is to transform the profiled 
program by isolating and optimizing frequently executed, or hot, paths. We call this 
transformation the express-lane transformation. An express-lane p is a copy of a hot 
path such that p has only one entry point at its beginning; p may have branches back 
to the original code, but the original code never branches into p. Classical data-flow 
analysis is likely to find sharper data-flow facts along the express lanes, since there are 
no join points. This may create opportunities for program optimization. 

We use the interprocedural express-lane transformation together with range analysis 
to perform program optimization. Our approach differs from the literature on profile- 
driven optimization in one or more of the following aspects: 

1 . We duplicate interprocedural paths. This may expose correlations between branches 
in different procedures, which can lead to more optimization opportunities Q. 

2. We perform code transformation before performing data-flow analysis. This allows 
us to use classic data-flow analyses. 

3. We guide path duplication using interprocedural path profiles. This point may sound 
redundant, but 0 , for example, uses edge profiles to duplicate intraprocedural paths. 
The advantage of using interprocedural path profiles is that we get more accuracy 
in terms of which paths are important. 

4. We perform interprocedural range analysis on the transformed graph. 

5. We attempt to eliminate duplicated code when there was no benefit to range analysis. 
This can help eliminate code growth. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 200-1^^ 2003. 
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This paper describes algorithms and presents experimental results for the approach to 
profile-driven optimization described above. Specifically, our work makes the following 
contributions: 

1. El provides an elegant solution for duplicating intraprocedural paths based on an 
intraprocedural path profile; this paper generalizes that work by providing algo- 
rithms that take a program supergraph (an interprocedural control-flow graph) and 
an interprocedural path profile and produce an express-lane supergraph. 

2. We show that interprocedural express-lane transformations yield benefits for range 
analysis: programs optimized using an interprocedural express-lane transformation 
and range analysis resolve (a) 0-7% more dynamic branches than programs opti- 
mized using the intraprocedural express-lane transformation and range analysis, and 
(b) 1.5-19% more dynamic branches than programs optimized using range analysis 
alone. 

3. We show that by using range analysis instead of constant propagation, the intrapro- 
cedural express-lane transformation can lead to greater benefit than previously re- 
ported. We also show that code growth due to the intraprocedural express-lane 
transformation is not always detrimental to program performance. 

4. Our experiments show that optimization based on an interprocedural express-lane 
transformation does benefit performance, though usually not enough to overcome 
the costs of the transformation. These results suggest that software and/or hardware 
support for entry and exit splitting may be a profitable research direction; entry and 
exit splitting are described in Section EH] 

The remainder of the paper is organized as follows: Section [^describes the relevant 
details of the interprocedural path-profiling techniques. Section [^describes the interpro- 
cedural express-lane transformations. Section[4|presents experimental results. Sectionj^ 
describes related work. 

2 Path Profiling Overview 

To understand the interprocedural express-lane transformation, it is helpful to understand 
the interprocedural paths that are duplicated. This section summarizes the relevant parts 
of IITOl and IfTTl . In these works, the Ball-Lams technique (4| is extended in several 
directions: 

1. Interprocedural vs. Intraprocedural: ifTOl presents interprocedural path-profiling 
techniques in which the observable paths can cross procedure boundaries. Interpro- 
cedural paths tend to be longer and to capture correlations between the execution 
behavior of different procedures. 

2. Context vs. Piecewise: In piecewise path profiling, each observable path corre- 
sponds to a path that may occur as a subpath (or piece) of an execution sequence. 
In context path profiling, each observable path corresponds to a pair (C, p), with an 
active-suffix p that corresponds to a subpath of an execution sequence, and a context- 
prefix C that corresponds to a context {e.g., a sequence of pending calls) in which p 
may occur. A context path-profiling technique generally has longer observable paths 
and maintains finer distinctions than a piecewise technique. 
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In this paper, we use three kinds of path profiles: Ball-Lams path profiles {i.e . , intraproce- 
dural piecewise path profiles) and the interprocedural piecewise and context path profiles 
of IHQIllH . (Our techniques could be applied to other types of path profiles.) 

Interprocedural path profiling works with an interprocedural control-flow graph 
called a supergraph. A program’s supergraph G* consists of a unique entry vertex 
Entry a unique exit vertex ExitgiohaU and a collection of control-flow graphs. 

The fiowgraph for procedure P has a unique entry 
vertex, Entry p, and a unique exit vertex, Exitp. The 
other vertices of the fiowgraph represent statements and 
predicates in the usual way, except that each procedure 
call in the program is represented a call vertex and a 
return-site vertex. For each procedure call to procedure 
P (represented, say, by call vertex c and return-site ver- 
tex r), G* contains a call-edge, c Entry p, and a 
return-edge, Exitp r. The supergraph also contains 
the edges Entry Entry and Exitmain 

Exit global • 

As in the Ball-Lams technique, the observable paths 
in the interprocedural path-profiling techniques are not 
allowed to contain backedges. Furthermore, an observ- 
able path cannot contain a call-edge or return-edge from 
a recursive call-site. (Recursive call-sites are those that 
are the source of a backedge in the call graph.) 

An observable path in an interprocedural context path 
profile may contain surrogate edges', surrogate edges are 
required because observable paths are not allowed to con- 
tain backedges. Unlike other edges in an observable path, 
a surrogate edge is not an edge in the supergraph. A sur- 
rogate edge Entry p )► in an observable path p rep- 

resents an unknown path fragment q that starts at the 

entry vertex Entry p of a procedure P and ends with a ^ ^ ^ 

, / , . , A 1 11 1 ig* 1 * Example ot an interpro- 

backedge to vertex v m procedure P. An observable path , . ^ ^ 

^ ^ ^ cedural context path. The active- 

from an interprocedural path profiling technique may ^^own in bold and sur- 

also contain summary edges. A summary edge connects rogate edges are shown using 
a call vertex with its return- site vertex. dashed-lines. 

In the context path-profiling technique, a context- 
prefix is a sequence of path fragments in the supergraph, 

each fragment connected to the next by a surrogate edge. The context-prefix summarizes 
both the sequence of pending call- sites and some information about the path taken to each 
pending call-site. Fig.[I]shows a schematic of an observable path from an interprocedural 
context path profile. 

Fig. |2] shows the average number of SUIFl instmctions in an observable path for 
several SPEC95 benchmarks. (For technical reasons discussed in lEl, there are some 
situations where an interprocedural piecewise path is considered to have a context-prefix, 
cf. m88ksim, li, perl, and vortex.) 



Entry giMExitgiM 
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Fig. 2. Graph of the average number of SUIF instructions in an observable path for interprocedural 
context, interprocedural piecewise, and intraprocedural piecewise path profiles of SPEC95 bench- 
marks when run on their reference inputs. Each observable path was weighted by its execution 
frequency. 



3 The Interprocedural Express-Lane Transformation 

The intraprocedural express-lane transformation takes a control-flow graph and an in- 
traprocedural, piecewise path profile and creates an express-lane graph 0. In this sec- 
tion, we describe how to extend this algorithm to take as input the program supergraph 
and an interprocedural path profile, and produce as output an express-lane supergraph. 

There are several issues that must be addressed. The definition of an express-lane must 
be extended. In a context path profile, a path may consist of a non-empty context-prefix 
as well as an active- suffix. Also, an observable path may contain “gaps” represented by 
surrogate edges. An express-lane version of an observable path may have a context-prefix 
and an active-suffix, and may have gaps just as the observable path does. 

There are also technical issues that must be resolved. The interprocedural express- 
lane transformation requires a mechanism for duplicating call-edges and return-edges. 
We will use a straightforward approach that duplicates a call edge c Entry p by 

creating copies of c and Entry p and duplicates a return edge Exitp — > r by creating 
copies of Exit p and r. 

Many modifications of the intraprocedural algorithm are required to obtain an al- 
gorithm for performing the interprocedural express-lane transformation. The Ammons- 
Larus express-lane transformation uses a hot-path automaton — a deterministic finite 
automaton (DFA) for recognizing hot-paths — and takes the cross product of this au- 
tomaton with the control- flow graph (CFG), which can be seen as another DFA. 
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To create an automaton that recognizes a set of interprocedural hot-paths, we require 
a pushdown automaton (PDA). The supergraph can be seen as a second PDA. Thus, if 
we mimic the approach in J3, we would need to combine two pushdown automata, a 
problem that is uncomputable, in general. Instead, we create a collection of deterministic 
finite automata, one for each procedure; the automaton for procedure P recognizes hot- 
paths that start in P. 

3.1 Entry and Exit Splitting 

The algorithm for performing the interprocedural express-lane transformation uses entry 
splitting to duplicate call-edges and exit splitting to duplicate return-edges Il5]6> . Entry 
splitting allows a procedure P to have more than one entry. Exit splitting allows a 
procedure P to have multiple exits, each of which is assigned a number. Normally, 
when a procedure call is made, the caller provides a return address. In the case where a 
procedure has multiple exits, the caller provides a vector of return addresses. When the 
callee reaches the exit vertex, it branches to the return address. Our implementation 

uses a semantically equivalent but inferior method of entry (and exit) splitting: each call 
vertex sets an entry number before making a normal procedure call; the called procedure 
(calling procedure) then executes a switch on the entry (exit) number to jump to the proper 
entry (return) point. 



3.2 Defining the Interprocedural Express-Lane 



main 



foo 



In this section, we give a definition of an in- 
terprocedural express-lane. Eirst we consider 
a simple example to develop intuition about 
what should happen when we duplicate an ob- 
servable path from an interprocedural context 
path profile. 

Example 1. Consider the supergraph shown 
in Eig. [3l Suppose we wish to create an 
express-lane version of the observable path 
p = [Entry^^i^ a ^ b ^ d ^ 

Entry F H ^ I] The context- 

prefix [Entry^^i^ a ^ b d 

Entry indicates a path taken in main to the 

call-site on foo. The active-suffix of p is [F 
H ^ I]. The principal difficulty in duplicat- 
ing p has to do with the edge Entry F: 

this surrogate-edge appears in the middle of 
the observable path, but does not appear in 
the supergraph. What does it mean to dupli- 
cate this edge? 

Peeking ahead, Eig. |9] shows an express-lane graph with an express-lane version 
of p. When we create an express-lane version of p, we create copies of the path’s 
context-prefix and its active- suffix. The copy of the context-prefix ends at a copy 




Fig. 3. Example supergraph. 




The Interprocedural Express-Lane Transformation 205 



[Entry 4] of vertex Entry The copy of the active-suffix begins at a copy [F, 8] 
of vertex F. We desire that any time execution reaches [F, 8], it came along a path from 
[Entry 4]: we want to make sure that the duplicated active-suffix executes in the 
context of the duplicated context-prefix, n 

We can now give a technical definition of an interprocedural express-lane: let G* be 
a supergraph and let p be an observable path. Let F* be a supergraph where every vertex 
of F* is a copy of a vertex in G*. Then an express-lane version of p is a sequence of 
vertices [ai, U 2 , . . . , a^] in F* such that the following properties are satisfied: 

Duplication property: is a copy of the vertex in p. 

Minimal predecessor property: A vertex may have multiple predecessors if = 
ai, or the (i — 1)^^ edge of p is a surrogate edge, or is a copy of a return-site 
vertex; otherwise has exactly one predecessor, which is a^_i. If is a copy of 
return-site vertex r then let c be the call vertex associated with r: 

- If there is a copy of c in [ai . . . a^-i], then is associated with one call vertex, 
the last copy of c in [ai . . . a^_i]; otherwise, may be associated with many 
call vertices. 

- If ai-i is a copy of an exit vertex, then is targeted by exactly one return-edge, 
a^_i ^ a^. If ai is a\ or a^_i is a copy of a call vertex, then may be targeted 
by multiple return-edges. 

Context property: For a vertex in procedure F, if there is a copy of Entry p in 
[ai . . . a^], then can reached by an intraprocedural path from the last copy of 
Entry p in [ai . . . a^] and not from any other copy of Entry p. 

These properties sometimes allow a vertex on an express-lane to have multiple predeces- 
sors {i.e., there may be branches into the middle of an express-lane). This is necessary 
because: (I) a surrogate edge u ^ v does not specify a direct predecessor vertex of v in 
the supergraph; (2) a return-site vertex always has both an intraprocedural predecessor 
(the call site vertex) and an interprocedural predecessor. 



3.3 Performing the Interprocedural Express-Lane Transformation 

We now present two algorithms for performing the interprocedural express-lane trans- 
formation, one for interprocedural piecewise path profiles, and one for interprocedural 
context path profiles. 

Our approach to constructing the express-lane supergraph consists of three phases: 

1 . Construct a family A of automata with one automaton for each procedure F. 
The automaton Apis specified as a DFA that recognizes (prefixes of) hot-paths that 
begin in F. 

2. Use the Interprocedural Hot-path Tracing Algorithm (see below) to combine A with 
the supergraph G* to generate an initial express-lane supergraph. 

3. Make a pass over the generated express-lane supergraph to add return-edges and 
summary-edges where appropriate. This stage finishes connecting the intraproce- 
dural paths created in the previous step. 
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The two algorithms for performing the interprocedural express-lane transformation differ 
slightly in the first step. 

The Hot-path Tracing Algorithm treats the automata in A as DFAs, though technically 
they are not: an interprocedural hot path p may contain “gaps” that are represented by 
surrogate- or summary-edges. These gaps may be filled by same-level valid paths, or 
SLVPs; an SLVP is a path in which every return-edge can be matched with a previous 
call-edge, and vice versa. An automaton that recognizes the hot-path p requires the ability 
to skip over SLVPs in the input string, which requires a PDA. However, we can treat the 
hot-path automata as DFAs for the following reasons: 

1 . The automata in A have transitions that are labeled with summary-edges . A transition 
{qi^c ^ r,Qj) that is labeled with a summary-edge c ^ r is considered to be an 
“oracle” transition that is capable of skipping over an SLVP in the input string. The 
oracle required to skip an SLVP is the supergraph-as-PDA. 

2. When we combine a hot-path automaton with the supergraph, an oracle transition 
{qi^c ^ r^Qj) will be combined with the summary-edge c ^ r of the supergraph 
to create the vertices [c, qi] and [r, qj] and the summary-edge [c, qi] [r, qj] in the 
express-lane supergraph. The justification for this is that the set of SLVPs that an 
oracle transition {qi^c ^ r^qj) should skip over is precisely the set of SLVPs that 
drive the supergraph-as-PDA from c to r. 

Throughout the following sections, our examples use the program shown in Fig.|3] 



The Hot-Path Automata for Interprocedural Piecewise Paths In this section, we 
show how to construct the set A of hot-path automata for recognizing hot interprocedural 
piecewise paths. We expand our definition of A to allow each automaton Ap G ^ to 
transition to other automata in A; thus, it is more accurate to describe A as one large 
automaton with several sub-automata. 

As in m , we build a hot-path automaton for recognizing a set of hot paths by building 
a trie A of the paths and defining a failure function that maps a vertex of the trie and a 
supergraph edge to another vertex of the trie ||21. We then consider A to be a DFA whose 
transition function is given by the edges of the trie and the failure function. 

For each procedure P, we create a trie of the hot paths that start in P. Hot paths 
that can only be reached by following a backedge ^ u are prefixed with the special 
symbol •y before they are put in the trie. A transition that is labeled by •y can match 
any backedge that targets v. Fig. [4| shows the path tries for the supergraph in Fig.|3]and 
the following paths: 

Entry main CL ^ b ^ d ^ Entry F G ^ I 

•f F ^ H ^ I 

^F E y G y I y J y Exitfoo ^ C y Exityriain 

Every hot-path prefix corresponds to a unique state in a path trie. If a hot-path prefix 
ends at a vertex v and drives an automaton to state q, we say that q represents v\ the root 
of the path trie for procedure P is said to represent Entry p. The fact that q represents 
vertex v is important, since for a vertex [v^ q] in the express-lane supergraph, either [v^ q] 
is not on an express-lane and q represents an entry vertex, or q represents v. 



ITie Intjeq^rocedufal h^xpress-Lane 1’i^nstomiation 2()7 




Fig. 4 Palli (lie fur mi iuLerproceduml piece- 
wise pafJi profile ot the supergraph in Pig. 0 
Fca i E [4„ 15] and a backedge c in foo, 
h(qife) ^ qg^ For i 6 [4.. 15] and a oon- 
backedge € iu foo, h{qiie) “ Fur i 6 
([0..3j U jl6..17j) and an edge e in main, 

h{qi,e) = <jo. 



Fig. 5. Pa(ii trie fur an iulerprucedund cuulexL 
path profile of the supergraph in J^ig. Q] l^or 
i E (|0..S] U 1 15.. 16]) and an edge e in main, 
h{qi^ e) ^ go Fori E [4.. 14] and a backedge 
e in jho, h{qi^ e) - q^\ fur i E [4.. 14) and a 
oon-backedge e in /bo, h(qi, c ) ^ qs. For q %7 
and any edge e in /bo, h{qiy, e) — git. 



As in m, we define a failure function h(q, u o) for a state q of any tiie and an 
inlraprocctlural orsuirnnaTy-algt: u v\ the failure fnnclioTi isnoldcnnctl for inlcrpro- 
cednrai edges. If q represents a vertex u\ of procedure P and u —^vis not a backedge. 
tlien h{q, u > v) = root^tricp, where root^tj'icp is the root of Uie trie for hot paths 
beginning in P.U?£ ^ v is a haeketlgc, then h{q, 7 l where is Ihe lArgcl 

state in the transition (root-trie, if there is no transition (root-trie p, J, 

tlieng,^ ^ root-trie p. 

The later phases of (he express-lane Iran shirin all tin make use of l.wo funelions, Lax- 
tActiveCalkr and LastEntry, which map trie states to trie states. For a state q that rep^ 
resents a va^tex in procedure P, LastAcUveCaller(q) maps to the most recent ancestor 
t)f q that represenls a call veri.ex lliaL makes a non-reeursive eall to P. fiastEntry(q) 
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maps to the most recent ancestor of q that represents Entry p. LastActiveCaller{q) and 
LastEntry(q) are undefined for q if there is no appropriate ancestor of q in the trie. 



The Hot-Path Automata for Interprocedural Context Paths The principal difference 
with the previous section is in how the failure function is defined. As above, a path trie 
is created for each procedure. Before a path is put into a trie, each surrogate edge u ^ v 
is replaced by an edge labeled with As before, matches any backedge that targets 
V. Fig. [5] shows the path tries for the supergraph in Fig. |3]and the paths: 

Entry main a ^ b ^ d ^ Entry E G ^ I 

Entry ma^n ^ Entry t E ^ H ^ I 

Entrymazn ^ a^b^ d^ Entry t Exitpo e 

y Exitrnain 

A state q that represents an entry vertex Entry p corresponds to a hot-path prefix 
p that describes a calling context for procedure P. For this reason, states in the trie 
that represent entry vertices take on special importance in this section. Also, the map 
LastEntry will be important. 

The maps LastActive Caller and LastEntry are defined as in the last section. The 
failure function is defined as follows: if rt ^ ^ is not a backedge, then h{q^u ^ v) = 
LastEntry {q). If u ^ v is sl backedge, then h{q, u ^ v) = q', where q' is the state 
reached by following the transition labeled •y from LastEntry {q); if there is no such 
state, then q' = LastEntry {q) . 

We now give some intuition for how the Hot-path Tracing Algorithm interacts with 
an automaton for interprocedural context paths. For any context-prefix p that leads to a 
procedure P, the Interprocedural Hot-path Tracing Algorithm may have to clone parts of 
P. This is required to make sure that the Context Property is guaranteed for express-lanes 
that begin with p (see Example 1). To accomplish this, the Hot-path Tracing Algorithm 
may generate many vertices [x, g], where q is the automaton state in hot-path automaton 
A that corresponds to the context-prefix p: when the hot-path automaton A is in the state 
q and is scanning a path [u ^ v ^ w ...] in procedure P that is cold in the context 
described by p, the automaton will stay in state q. Thus, the Interprocedural Hot-path 
Tracing Algorithm generates the path [[u^q] [v^q] [w^q] . . .]. Only when the 

tracing algorithm begins tracing a path that is hot in the context of p does the hot-path 
automaton move out of state q. 



Phase 2: Hot-Path Tracing of Intraprocedural Path Pieces This section describes 
the hot-path tracing algorithm that combines the family A of hot-path automata with the 
supergraph. A state g is a reset state if h{q^ u ^ v) = g for some non-backedge u ^ v. 
Reset states are important for several reasons: (1) a context-prefix p always drives a hot- 
path automaton to a reset-state; (2) for every vertex [u, g] in the express-lane supergraph 
that is not part of an express-lane {i.e., [v^ q] is part of residual, cold code), g is a reset 
state; and (3) for a reset state g and an express-lane supergraph vertex [v^q], either v is an 
entry vertex represented by g, or [v, q] is a cold vertex. We use these facts to determine 
whether an express-lane supergraph vertex [v, q] is part of an express-lane. 
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G* is the input supergraph. 

^ is a family of hot-path automata, with one automaton for each procedure in G* 

Ap e A denotes the automaton for procedure P 

Tp denotes the transition relation of Ap 

T is the disjoint union of all Tp 

root-triemain is the start state of Amain 

IT is a worklist of express-lane supergraph vertices 

H* = (T, E) is the express-lane supergraph 



Main() 

/* First, create all the vertices that might begin a hot-path */ 

1 : V = {Entry'g,„^^i , Exit'^i^ ] 

2: Foreach procedure P 

3: Create Vertex {[Entry p, root .trie p]) /* See Figure 8 V 

4: If there is a transition {root -trie p,*r,Q') where r is a return- site vertex 

/* For hot-paths that begin at return-sites, start the express-lane. V 
5 : Create Vertex ( [r , q'] ) 

6: E = {Entry'gi^^^i [Entry root. trie ma^n]} 



8 : 

9: 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 
23 



While IT / 0 

[v, q] = Take{W) /* select and remove an element from W */ 

If is a call vertex 

Process Call Vertex ( [r» , g] ) 

Else If V is an exit vertex 

ForeachEdge v ^ rmG* 

/* V ^ r is a return-edge */ 

If there is a transition {q^v r^q') G T. 

CreateVertex{[r^ q']) /* See Figure 8 */ 

Else 

ForeachEdge v ^ v' in G* 

Let q' be the unique state such that {q, v ^ v',q') G T. 
Create Vertex {[v' ^q']) 

E = EU{[v,q] ^ [v',q']] 

Eoreach vertex [Exit main ^q\ ^ L 

E = E W^Exitmain: 0\ global^ 

End Main 



Fig. 6. Interprocedural Hot-Path Tracing Algorithm. 



Fig. 0 and |T] show the Interprocedural Hot-path Tracing Algorithm. The bulk of the 
work of the Interprocedural Hot-Path Tracing Algorithm is done by lines 19-21 of Fig. 0 
these process each express-lane supergraph vertex [u, q] that is not a call or exit vertex. 
This part of the algorithm is very similar to 0: given an express-lane supergraph vertex 
[v^q], a supergraph edge v ^ v' (which represents the transition {v^v ^ v' ^v') m the 
supergraph-as-PDA), and a transition {q^v ^ v', lines 19-21 “trace out” a new edge 
[v,q] -A [v'^ q'] in the express-lane supergraph. If necessary, anew vertex [v' , q'] is added 
to the express-lane supergraph and the worklist W. 

The Interprocedural Hot-Path Tracing Algorithm differs from its intraprocedural 
counterpart in the processing of call and exit vertices. Fig. [ 7 ] shows the function 
ProcessCallVertex that is used to process a call-vertex [c, q]. Process CallVertex has 
two responsibilities: (I) it creates call-edges from [c, q ] ; and (2) it must creates return- site 
vertices [r, q'] that could be connected to [c, q] by a summary-edge in Phase 3 of the con- 
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CreateVertex([?;, g]) 

24: \f[v,q]^V 

25: V = VU{[v,q]] 

26: Put{W,[v,q]) 

End Create Vertex 

ProcessCallVertex([c, q\) /* c is a call vertex */ 

27 : Let r be the return-site vertex associated with c 

/* Create call edges to all appropriate entry vertices */ 

28: ForeachEdge c Entry p 

/* V may have many callees if it is an indirect call-site V 
29: If {q,c Entry p,q') G T 

/* There is a hot path continuing from c along the edge c Entry p */ 
30: CreateVertex {[Entry p,q']) 

31: E = E U {[c,q] ^ [Entry p,q']} 

32: Label [c, q] [Entry p,q'] with “([c,^]” 

33: Else 

/* Hook up [c, q\ to a cold copy of Entry p */ 

34: CreateVertex {[Entry p , root -trie p]) 

35: E = E U {[c,q] ^ [Entry p, root -trie p]} 

36: Label the call-edge [c, q\ [Entry p^ root-trie p] with “([c,g]” 

/* Create every return- site vertex [r,q’] that could be needed in phase 3 */ 

37: Let q' be the unique state such that {q,v r,q') G T 

38: CreateVertex {[r,q']) 

End ProcessCallVertex 

Fig. 7. The procedures CreateVertex and ProcessCallVertex used in Fig.[6l 



stmction. If Phase 3 does not create the summary-edge [c, g], then [r, q'] is unnecessary 
and will be removed from the graph in Phase 3. 

Phase 3: Connecting Intraprocedural Path Pieces The third phase of the interproce- 
dural express-lane transformation is responsible for completing the express-lane super- 
graph iT*. It must add the appropriate summary-edges and return-edges. Formally, this 
phase of the interprocedural express-lane transformation ensures the following: 

For each call vertex [c, q] 

For each call-edge [c, q] [Entry q'] 

For each exit vertex [Exitp, q"] reachable from [Entry p^q'] by an SLVP 
There must be a return-site vertex [r, q'"] such that 

1. There is a summary-edge [c, q] [r, q'"] 

2. There is a return-edge [Exitp, q"] [r, q'"] 

The algorithm for Phase 3 is given in IfTTII . Section 7.3.4. 

4 Experimental Results 

This section is broken into two parts. Section [4J1 discusses the effects of the various 
express-lane transformations on interprocedural range analysis. Section o presents 
experimental results on using the express-lane transformation and range analysis to 
perform program optimization. 
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Fig. 8. Express-lane supergraph for the supergraph in Eig.[3]and the hot-path automaton in Eig.|4l 
Most of the graph is constructed during Phase 2 of the construction. The edges [d, 3] ^ [e, 16], 
[d, 3] ^ [e, 16], [d, 3] ^ [e, 0], [c?, 0] ^ [e, 0], and [Exitfoo,S] [e, 0] are added during Phase 3. 
Each shaded vertex [v,q] has a state q that is a reset state; except for [Entry and 
[Entry 4], these are cold vertices. 



4.1 Effects of the Express-Lane Transformation on Range Analysis. 

We have written a tool in SUIF 1 .3.0.5 called the Interprocedural Path Weasel (IPW) that 
performs the interprocedural express-lane transformation^ The program takes as input 
a set of C source files for a program P and a path profile pp for P. IPW first identifies 
the smallest subset pp' of pp that covers 99% of the SUIF instructions executed]^ Next, 
IPW performs the appropriate express-lane transformation on P, creating an express- 
lane version of each path in pp' . Finally, IPW performs interprocedural range analysis 
on the express-lane (super)graph. 

The experiments with IPW were run on a 550 MhZ Pentium III with 25 6M RAM 
running Solaris 2.7. IPW was compiled with GCC 2.95.3 -03. Each test was run 3 
times, and the run times averaged. Cols. 3-5 of Table [T] compare the code growth and 
the increase in range-analysis time for the different express-lane transformations. 

To evaluate the results of range analysis on a program P, we weighted each data- 
flow fact in vertex v by the execution frequency of v. Columns 6-8 of Table [T| comp are 

^ The tool is named after, and based on, Glenn Ammons’s tool Path Weasel, which performs the 
intraprocedural express-lane transformation [0. 

^ The value 99% was arrived at experimentally; duplicating more paths does not cause a greater 
benefit for range analysis, but it does cause a significant increase in code growth Ga. 
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foo 




Fig. 9. Express-lane supergraph for the hot-path automaton in Fig.[5]and the supergraph in Fig.|3l 
Most of the graph is constructed during Phase 2. The edges [d, 3] ^ [e, 15], [d, 3] ^ [e,0], 
[d, 0] ^ [e,0], and [Exit foo, 17] ^ [e,0] are added during Phase 3. Each shaded vertex [v,q] 
has a state q that is a reset state; except for [Entry and [Entry 4], these are cold 
vertices. 



the results of range analysis after the express-lane transformations have been performed. 
Three comparisons are made: the percentage of instruction operands that have a constant 
value; the percentage of instructions that have a constant result; and the percentage of 
decided branches, or conditional branch instructions that are determined to have only 
one possible outcome. In all cases, the interprocedural express-lane transformations do 
better than the intraprocedural express-lane transformation. 

The range analysis we use allows the upper bound of a range to be increased once be- 
fore it widens the upper bound to (MaxVal — 1) . Lower bounds are treated similarly. Our 
range analysis is similar to Wegman and Zadeck’s conditional constant propagation d 
in that (1) it simultaneously performs dead code analysis and (2) it uses conditional 
branches to refine the data-flow facts. 

4.2 Using the Express-Lane Transformation for Program Optimization 

As mentioned in the introduction, it is possible to reduce the express-lane graph while 
preserving “valuable” data-flow facts. We used three different reduction strategies: 

1. Strategy 1 preserves data-flow facts that determine the outcome of a conditional 
branch. Strategy 1 is based on the Coarsest Partitioning Algorithm mm 
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Table 1. Columns 3-5 show a comparison of the (compile-time) cost of performing various 
express-lane transformations and the (compile-time) cost of performing interprocedural range 
analysis after an express-lane transformation has been performed; times are measured in sec- 
onds. Columns 6-8 show a comparison of the results of range analysis after various express-lane 
transformations have been performed. 



Benchmark 


E-Lane 

Transform. 


Transform. 
Time (sec) 


# Vertices in 
E-Lane Graph 


Range Prop. 
Time (sec) 


% const, 
operands 


% const, 
results 


% decided 
branches 


124.m88ksim Inter., Context 


9.8 


24032 


569.5 


28.5 


33.1 


19.7 




Inter., Piecewise 


4.9 


15113 


508.4 


28.6 


33.2 


20.0 




Intra., Piecewise 


3.0 


14218 


734.2 


27.7 


32.3 


17.5 




None 


- 


11455 


300.8 


25.9 


31.1 


0.8 


129. compress Inter., Context 


1.4 


2610 


14.7 


21.3 


26.9 


9.8 




Inter., Piecewise 


0.3 


1014 


9.4 


21.3 


26.9 


9.8 




Intra., Piecewise 


0.2 


696 


10.2 


21.3 


26.2 


2.2 




None 


- 


522 


5.2 


20.8 


25.8 


0.0 


130.H 


Inter., Context 


12.9 


23125 


99.1 


24.1 


27.3 


4.0 




Inter., Piecewise 


5.3 


11319 


73.2 


24.1 


27.3 


3.9 




Intra., Piecewise 


1.9 


7940 


35.7 


23.6 


26.8 


2.2 




None 


- 


7240 


29.0 


23.3 


26.5 


0.0 


132.ijpeg 


Inter., Context 


13.0 


18087 


628.8 


16.8 


23.6 


4.0 




Inter., Piecewise 


8.5 


13768 


526.1 


16.8 


23.6 


4.0 




Intra., Piecewise 


7.1 


12955 


504.3 


16.6 


23.3 


1.4 




None 


- 


12192 


488.2 


15.9 


22.7 


0.0 


134.perl 


Inter., Context 


10.3 


33863 


713.8 


24.3 


28.8 


3.3 




Inter., Piecewise 


9.0 


30189 


655.2 


24.2 


28.8 


3.0 




Intra., Piecewise 


6.7 


29309 


718.6 


24.1 


28.7 


2.8 




None 


- 


27988 


573.9 


23.0 


28.5 


1.3 



2. Strategy 2 preserves all data-flow facts. Strategy 2 is based on the Coarsest Parti- 
tioning Algorithm and the Edge Redirection Algorithm given in GD. 

3. Strategy 3 is similar to Strategy 2, but only preserves data-flow facts that decide 
conditional branches (as in Strategy 1). 

CD contains more details, and more discussion of the trade-offs between these strategies. 
Fig. Uni compares the amount of reduction achieved by these strategies. 

Tables[2| through |4]show the results of using various forms of the express-lane trans- 
formation together with Range Analysis to optimize SPEC95Int benchmarks. Specifi- 
cally, we followed these steps: 

1. Perform an express-lane transformation. 

2. Perform interprocedural range analysis on the express-lane (super)graph. 

3. Reduce the express-lane (super)graph. 

4. Eliminate decided branches and replace constant expressions. 

5. Emit C source code for the transformed program. 

6. Compile the C source code using GCC 2.95.3 -03. 

7. Compare the runtime of the new program with the runtime of the original program. 

For a base case, we performed range analysis without any express-lane transformation 
(repeated as Col. 2 in Tables |2] through [4l). We ran experiments with three different 
express-lane transformations. For each of the transformations, we tried the three reduc- 
tion strategies listed above. We also ran experiments where we performed an express-lane 
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124.m88ksim 129.compress 130. li 132.ijpeg 134. perl Geometric 



Reducing the Intraprocedural, Piecewise Express-Lane Graph 




124.m88ksim 129.compress 130. li 132.ijpeg 134. perl Geometric 




□ Strategy 1 

■ Strategy 2 

■ Strategy 3 



Fig. 10. Comparison of the strategies for reducing the express-lane supergraph. 



transformation, then used Strategy 1 to reduce the express-lane (super)graph and then 
skipped Step 4 above. The reported run time is always the average of three runs. 

The best results were for the intraprocedural express-lane transformation (Tables |4]l. 
The intraprocedural express-lane transformation together with the range analysis opti- 
mizations has a benefit to performance even when no reduction strategy is used to limit 
code growth. In fact, aggressive reduction strategies can destroy the performance gains. 
There are several possible reasons for this: 

1 . GCC may be able to take advantage of the express-lane transformation to perform 
its own optimizations (e.g., code layout fTll. 

2. Reduction of the hot path graphs may result in poorer code layout that requires more 
unconditional jumps along critical paths o. 

3. The more aggressive reduction strategies seek only to preserve decided branches, 
and may destroy data-flow facts that show an expression to have a constant value. 

4. The code layout for the reduced graph may interact poorly with the I-cache. 

The results shown in Tables 4 and 5 are often (but not always) negative. There are 
two likely reasons for this: 

1 . It would have been difficult to modify an x86 code generator or a hardware simulator 
to support entry and exit splitting; instead, we used a straightforward implementation 
in software. This incurred overhead on each procedure entry and exit. 

2. There is a significant increase in code growth. 

Col. 4 of Tables 4 and 5 (and 6) show the performance overhead incurred by the trans- 
formations. Fig. [la shows reasonable code growth for the interprocedural express-lane 
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Table 2. Program speedups due to the interprocedural, context express-lane transformation and 
range propagation. For the base run times shown in Col. 2, the benchmarks were optimized by re- 
moving decided branches and constant expressions (but without any express-lane transformation) 
and then compiled using GCC 2.95.3 -03. 



Reduction Strategy 



Benchmark 


Base run 
time (sec) 


None 


Strategy 1 


Strategy 1, 
No Step 4 


Strategy 2 Strategy ; 


124.m88ksim 


146.70 


-34.7% 


-9.3% 


-29.5% 


-13.1% 


-11.4% 


129. compress 


135.46 


-14.0% 


1.0% 


-4.3% 


2.4% 


2.0% 


130.1i 


125.81 


-57.2% 


-20.4% 


-27.8% 


-30.4% 


-25.4% 


132.ijpeg 


153.83 


-7.5% 


-1.6% 


-1.2% 


-4.5% 


-4.8% 


134.perl 


109.04 


-21.3% 


4.9% 


6.0% 


-3.1% 


-3.0% 



Table 3. Program speedups due to the interproc., piecewise express-lane trans. and range prop. 



Reduction Strategy 



Benchmark 


Base run 
time (sec) 


None 


Strategy 1 


Strategy 1, 
No Step 4 


Strategy 2 Strategy 


124.m88ksim 


146.70 


-13.6% 


-0.7% 


-11.4% 


5.7% 


5.4% 


129. compress 


135.46 


-14.0% 


0.5% 


-4.5% 


-0.2% 


2.0% 


130.1i 


125.81 


-68.1% 


-26.7% 


-40.1% 


-11.4% 


2.5% 


132.ijpeg 


153.83 


-2.3% 


-2.2% 


-0.8% 


-2.2% 


-4.2% 


134.perl 


109.04 


-19.4% 


2.8% 


2.7% 


6.1% 


3.6% 



transformation, and we assume that most of the performance degredation is due to entry 
and exit splitting. Using the reduction strategies with the interprocedural express-lane 
transformations usually helps performance. (Graph reduction may eliminate the need 
for entry and exit splitting.) With aggressive reduction, the interprocedural piecewise 
express-lane transformation usually leads to performance gains (see Col. 6 of TableE]). 

It should also be noted that the interprocedural express-lane transformations com- 
bined with the range-analysis optimizations do have a strong positive impact on program 
performance, although it is usually not as great as the costs incurred by the transforma- 
tions. This can be seen in the experiments where we did not eliminate branches and 
replace constants: cf. Columns 3 and 4 of the Tables 4 and 5. (In those few cases where 



Table 4. Program speedups due to the intraproc., piecewise express-lane trans. and range prop. 



Reduction Strategy 



Benchmark 


Base run 
time (sec) 


None Strategy 1 


Strategy 1, 
No Step 4 


Strategy 2 Strategy 


124.m88ksim 


146.70 


10.6% 


13.0% 


1.2% 


11.6% 


7.4% 


129. compress 


135.46 


6.4% 


5.5% 


-2.1% 


2.1% 


0.1% 


130.1i 


125.81 


8.1% 


10.3% 


7.2% 


-1.7% 


-0.6% 


132.ijpeg 


153.83 


1.0% 


0.7% 


-0.1% 


-1.6% 


-2.0% 


134. perl 


109.04 


9.7% 


10.0% 


6.3% 


9.9% 


5.4% 
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performance showed a slight improvement, we assume there was a change in code layout 
that had instruction cache effects.) This suggests that software and/or hardware support 
for entry and exit splitting would be a profitable research direction. 

5 Related Work and Conclusions 

The work in this paper is an interprocedural extension of the work in E. This paper 
and a are related to other work that focuses on improving the performance of particu- 
lar program paths. A partial list of such works includes I8l7l9ll3l6ll5ll . A more detailed 
discussion of related work can be found in d . As stated in the introduction, the inter- 
procedural express-lane transformation differs from other techniques in the literature on 
one or more of the following points: 

1. We duplicate interprocedural paths before performing analysis. 

2. We guide path duplication using interprocedural path profiles. 

3. We perform interprocedural range analysis on the transformed graph. 

4. We eliminate duplicated code when there was no benefit to range analysis. 

We have shown that the interprocedural express-lane transformations have a benefi- 
cial effect on interprocedural range analysis. The performance gains from the interpro- 
cedural express-lane transformation are slight or negative — but we have shown that it 
has potential. Specifically, we have shown that a greater percentage of dynamic branches 
can be decided statically, and that performance improvements are likely with a better 
hardware and/or software implementation of entry and exit splitting. 
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Abstract. One of the most common programming errors is the use of a 
variable before its definition. This undefined value may produce incorrect 
results, memory violations, unpredictable behaviors and program failure. 
To detect this kind of error, two approaches can be used: compile-time 
analysis and run-time checking. However, compile-time analysis is far 
from perfect because of complicated data and control flows as well as 
arrays with non-linear, indirection subscripts, etc. On the other hand, 
dynamic checking, although supported by hardware and compiler tech- 
niques, is costly due to heavy code instrumentation while information 
available at compile-time is not taken into account. 

This paper presents a combination of an efficient compile-time analysis 
and a source code instrumentation for run-time checking. All kinds of 
variables are checked by PIPS, a Fortran research compiler for program 
analyses, transformation, parallelization and verification. Uninitialized 
array elements are detected by using imported array region, an efficient 
inter-procedural array data flow analysis. If exact array regions cannot be 
computed and compile-time information is not sufficient, array elements 
are initialized to a special value and their utilization is accompanied by 
a value test to assert the legality of the access. In comparison to the 
dynamic instrumentation, our method greatly reduces the number of 
variables to be initialized and to be checked. Code instrumentation is 
only needed for some array sections, not for the whole array. Tests are 
generated as early as possible. In addition, programs can be proved to 
be free from used-before-set errors statically at compile-time or, on the 
contrary, have real undefined errors. Experiments on SPEC95 CEP show 
encouraging results on analysis cost and run-time overheads. 



1 Introduction 

Used-hefore-set refers to the error occurring when a program uses a variable 
which has not been assigned a value. This uninitialized variable, once used in 
a calculation, can be quickly propagated throughout the entire program and 
anything may happen. The program may produce different results each time it 
runs, or may crash for no apparent reason, or may behave unpredict ably. This 
is also a known problem for embedded software. Some programming languages 
such as Java and C++ have built-in mechanisms that ensure memory to be 
initialized to default values, which make programs work consistently but may 
not give intended results. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 217j23l] 2003. 
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To detect this kind of error, two approaches can be used: compile-time anal- 
ysis and run-time checking. However, compile-time analysis is far from perfect 
because complicated data and control flows result in a very large imprecision. 
Furthermore, the use of global variables and arrays with non-linear, indirection 
subscripts, etc sometimes makes static checking completely ineffective, leading 
to many spurious warnings. In addition, some other program analyses such as 
points-to analysis |T|, alias analysis and array bound checking |2] are prerequisite 
for the detection of uninitialized variable uses. 

On the other hand, pure dynamic checking is costly due to heavy code instru- 
mentation while information available at compile-time is not taken into account. 
The slowdown between instrumented and uninstrumented codes has been mea- 
sured to be up to 130 times in [3|. Dynamic checking is not so effective that, 
as shown in a report comparing Fortran compilers [4|, only some Lahey/Fujitsu 
and Salford compilers offer run-time checking for all kinds of variables. The 
other compilers such as APF version 7.5, G77 version 0.5.26, NAS version 2.2 
and PGI version 3.2-4 do not have this option. Intel Fortran Gompiler version 6.0 
and NAGWare F95 version 4.1 only check for local and formal scalar variables; 
array and global variables are omitted. The code instrumentation degrades the 
execution performance so it can only be used to create a test version of the pro- 
gram, not a production version. In addition, run-time checking only validates 
the code for a specific input. 

With the growth of hardware performance - processor speed, memory band- 
width - software systems have become more and more complicated to solve better 
real application problems. Debugging several million lines of code becomes more 
difficult and time-consuming. Execution time overheads of dynamic checking or 
a large number of possibly undefined variable warnings issued by static checking 
are not highly appreciated. Efficient compile-time analysis to prove the safety of 
programs, to detect statically program errors, or to reduce the number of run- 
time checks is necessary. The question is, by using advanced program analyses, 
i.e interprocedural array analysis, can static analysis be an adequate answer to 
the used-before-set problem for the scientific codes? If not, can a combination of 
static and dynamic analyses reduce the cost of uninitialized variable checking? 
The goal of our research is to provide a more precise and efficient static analysis 
to detect uninitialized variables, and if sufficient information is not available, 
run-time checks are added to guarantee the program correctness. 

The paper is organized as follows. Section 2 presents some related work on 
uninitialized variable checking. Section 3 describes the imported array regions 
analysis. Our used-before-set verification is presented in Section 4. Experimen- 
tal results obtained with the SPEG95 GEP benchmark are given in Section 5. 
Gonclusions are drawn in the last section. 

2 Related Work 

To cope with the used-before-set problem, some compilers silently initialize vari- 
ables to a predefined value such as zero, so that programs work consistently, but 
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give incorrect results. Other compilers provide a run-time check option to spot 
uses of undefined values. This run-time error detection can be done by initial- 
izing each variable with a special value, depending on the variable type. If this 
value is encountered in a computation, a trap is activated. This technique was 
pioneered by Watfor, a Fortran debugging environment for IBM mainframes in 
the 70’s, and then used in Salford, SGI and Cray compilers. 

For example, the option trap .uninitialized of SGI compilers forces all real 
variables to be initialized with a NaN value (Not a Number - IEEE Standard 
754 Eloating Point Numbers) and when this value is involved in a floating-point 
calculation, it causes a floating-point trap. This approach raise several problems. 
Exception handler functions or compiler debugging options can be used to find 
the location of the exception but they are platform- and compiler-dependent. 
Eurthermore, the IEEE invalid exception can be trapped for other reasons, not 
necessarily an uninitialized variable. In addition, when no floating-point calcu- 
lation is done, e.g in the assignment X = Y, no used-before-set error is detected 
which makes tracking the origin of an error detected later difficult. Other kinds of 
variables such as integer and logical are not checked for uninitialization. In con- 
clusion, the execution overhead of this technique is low but the used-before-set 
debugging is almost impossible. 

Other compilers such as Lahey/Eujitsu compilers, SUN dbx debugger, etc use 
a memory eoloring algorithmic detect run-time errors. Eor example, Valgrind, an 
open-source memory debugger {http://devel-home.kde.org/sewardj) tracks each 
byte of memory in the original program with nine status bits, one of which tracks 
the addressability of that byte, while the other eight track the validity of the byte. 
As a result, it can detect the use of single uninitialized bits, and does not report 
spurious errors on bit-field operations. The object code instrumentation method 
is used in Purify, a commercial memory access checking tool. Each memory 
access is intercepted and monitored. The advantage of this method is that it 
does not require recompilation and it supports libraries. However, the method 
is instruction set and operating system dependent. The memory usage is bigger 
than other methods. Eurthermore, the program semantics is lost. An average 
slowdown of 5.5 times between the generated code and the initial code is reported 
for Purify in [5]. 

The plusEORT toolkit {http://www.polyhedron.eom) instruments source code 
with probe routines so that uninitialized data can be spotted at run-time using 
any compiler and platform. There are functions to set variables to undefined, 
and functions to verify if a data item is defined or not. Variables of all types are 
checked. The amount of information about violations provided by plusEORT is 
precise and useful for debugging. The name of the subprogram, the line where the 
reference to an uninitialized variable occurred is reported in a log-file. However, 
the instrumentation is not so effective because of inserted code. Eig.[T]shows that 
plusEORT can detect bugs which depend on external data, but the execution 
time is greatly increased with such dynamic tests. 

To reduce the execution cost, illegal uses of an uninitialized variable can be 
detected at compile-time by some compilers and static analyzers. LGLint 0 is 
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SUBROUTINE QKNUM(IVAL, POS) 

INTEGER IVAL, POS, VALS(50), lOS 
CALL SB$ENT(^QKNUM\ ^DYNBEF.FORO 
CALL UD$I4(I0S) 

CALL UD$AI4(50, VALS) 

READ (11, *, lOSTAT = lOS) VALS 
CALL QD$I4(^I0S\ lOS, 4) 

IF (lOS .EQ. 0) THEN 
DO POS = 1,50 

CALL QD$I4(^ VALS (POS) ^ , VALS (POS) , 6) 
CALL QD$I4(^IVAL\ IVAL, 6) 

IF (IVAL .EQ. VALS (POS)) GOTO 10 
ENDDO 
ENDIF 
POS = 0 

10 CALL SB$EXI 
END 



Fig. 1. Example with plusFORT probe routines 



an advanced C static checker that uses formal specification written in the LCL 
language to detect instances where the value of a location may be used before it 
is defined. Although few spurious warnings are generated, there are cases where 
LCLint cannot determine if a use-before-definition error is present, so a message 
may be issued for a non-existing problem. In other cases, a real problem may go 
undetected because of some simplified assumptions. 

The static analyzer ftnchek {http://www.dsm.fordham.edu/ftnchek) gives ef- 
ficient warnings about possible uninitialized variables, but the analysis is not 
complete. Warnings about common variables are only given for cases in which a 
variable is used in some routine but not set in any other routine. It also has the 
same problems as LCLint about non-existing or undetected errors, because of, 
for example the simplified rule about equivalenced arrays. 

Reps et al. |I] consider the possibly uninitialized variable as an IFDS {inter- 
procedural, finite, distributive, subsets) problem. A precise interprocedural data 
flow analysis via graph reachability is implemented with the Tabulation Algo- 
rithm to report the uses of possibly uninitialized variables. They compare the 
accuracy and time requirement of the Tabulation Algorithm with a naive algo- 
rithm that considers all execution paths, not only interprocedurally realizable 
ones. The number of possibly uninitialized variables detected by their algorithm 
ranges from 9% to 99% of that detected by the naive one. However, this is only 
an over-approximation that does not give an exact answer if there are really 
use-before-set errors in the program or not. The number of possibly undefined 
variables is rather high, 543 variables for a 897 line program, 894 variables for a 
1345 line program. 

PolySpace technologies {http://www.polyspace.com) apply abstract interpre- 
tation, the theory of semantic language approximations, to detect automatically 
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read accesses to non-initialized data. This technique predicts efficiently run-time 
errors and information about maybe non-initialized variables can be useful for 
the debugging process of C and Ada programs. But no data is given by the 
PolySpace group so we cannot compare with them. 

Another related work of Feautrier [8] proposes to compute for each use of a 
scalar or array cell its source function^ the statement that is the source of the 
value contained therein at a given instant of the program execution. Uninitialized 
variable checking can be done by verifying in the source the presence of the sign 
i_, which indicates access to an undefined memory cell. Unfortunately, the input 
language in this paper is restricted to assignment statements, FOR loops, affine 
indices and loop limits. 

The main difficulties encountered by static analysis for the used-before-set 
verification are complicated data and control flows with different kinds of vari- 
ables. This explains why only a small set of variables such as scalar and local 
variables is checked by some static analyzers and compilers. Our motivation is 
to develop a more efficient program analysis to the used-before-set problem by 
using imported array regions. 

3 Imported Array Region Analysis 

Array region analyses collect information about the way array elements used 
and defined by programs. A convex array region^ as defined in jQIlQJ . is a set 
of array elements described by a convex polyhedron m- Its constraints link 
the region parameters that represent the array dimensions to the values of the 
program integer scalar variables. A region has the approximation MUST if every 
element in the region is accessed with certainty, MAY if its elements are simply 
potentially accessed and EXACT if the region exactly represents the requested 
set of array elements. There were two kinds of array regions, READ and WRITE 
regions, that represent the effects of program statements on array elements. 
For instance, A-WRITE-EXACT-{PHI1==1,PHI2==I} is the array region of statement 
A(1,I)=5. The region parameters PHIl and PHI2 respectively represent the first 
and second dimensions of A. 

The order in which references to array elements are executed, array data 
flow information, is essential for program optimizations. IN array regions are 
introduced in [TT)in] to summarize the set of array elements whose values are 
imported (or locally upward exposed) by the current piece of code. One array 
element is imported by a fragment of code if there exists at least one use of 
the element whose value has not been defined earlier in the fragment itself. 
For instance, in the illustrative example in Fig. El the element B(J,K) in the 
second statement of the second J loop is read but its value is not imported 
by the loop body because it is previously defined by the first statement. On the 
contrary, the element B(J,K-1) is imported from the first J loop. The propagation 
of IN regions begins from the elementary statements to compound statements 
such as conditional statements, loops and sequences of statements, and through 
procedure calls. The input language of our analysis is Fortran. 
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K = FOOO 
DO I = 1,N 
DO J = 1,N 

B(J,K) = J + K 
ENDDO 
K = K + 1 
DO J = 1,N 

B(J,K) = J*J - K*K 
A(I) = A(I) + B(J,K) + B(J,K-1) 
ENDDO 
ENDDO 



Fig. 2. Imported array region example 



Elementary Statement. The IN regions of an assignment are all read refer- 
ences of the statement. Each array reference on the right hand side is converted 
to an elementary region. Array references in the subscript expressions of the left 
hand side reference are also taken into account. These regions are EXACT if and 
only if the subscripts are affine functions of the program variables. To save space, 
regions of the same array are merged by using the union operator. 

The IN regions of an input /output statement are more complicated. The 
input/output status, error and end-of-file specifiers are handled with respect to 
the Fortran standard [ 13 ]. The order of variable occurrences in the input list is 
used to compute the IN regions of an input statement. For example, in the input 
statement READ *,n,(A(I),I=1,N), N is not imported since it is written before 
being referenced in the implied-DO expression (A(I) ,I=1,N). 

Conditional Statement. The IN regions of a conditional statement contain 
the READ regions of the test condition, plus the IN regions of the true branch 
if the test condition is evaluated true, or the IN regions of the false branch if 
the test condition is evaluated false. Since the test condition value is not always 
known at compile-time, the IN regions of the true and false branches, combined 
with the test condition, are unified in the over-approximated regions. 

Loop Statement. The IN regions of a loop contain array elements imported 
by each iteration but not previously written by the preceding iterations. Given 
the IN and WRITE regions of the loop body, the loop IN regions contain the 
imported array elements of the loop condition, plus the imported elements of the 
loop body if this condition is evaluated true. Then, when the loop is executed 
again, in the program state resulting from the execution of the loop body, they 
are added to the set of loop imported array elements in which all elements written 
by the previous execution are excluded. 

Sequence of Statements. Let s be the sequence of instructions 5 i; S2; .. 5 ^;. 
The IN regions of the sequence contain all elements imported by the first state- 
ment 5 i, plus the elements imported by 52; after the execution of si, but 
not written by the latter. 

Control Flow Graph. Control flow graphs are handled in a very straight- 
forward fashion: the IN regions of the whole graph are equal to the union of the 
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IN regions imported by all the nodes in the graph. Every variable modified at 
a node is projected from the regions of all other nodes. All approximations are 
decreased to MAY. 

Interprocedural Array Region. The interprocedural propagation of IN 
regions is performed by a reverse invocation order traversal on the program 
call graph: a procedure is processed after its callees. For each procedure, the 
summary IN regions are computed by eliminating local effects from the IN 
regions of the procedure body. Information about formal parameters, global 
and static variables are preserved. The resulting summary regions are stored in 
the database and retrieved each time the procedure is invoked. At each call site, 
the summary IN regions of the called procedure are translated from the callee’s 
name space into the caller’s name space, using the relationships between actual 
and formal parameters, and between the declarations of global variables in both 
routines. 

Fig. [3| shows the IN regions computed for the running example. In the body of 
the second J loop, array elements A (I) , B(J,K) and B(J,K-1) are imported by the 
second statement. Since B(J,K) is defined by the first statement, only A (I) and 
B(J,K-1) are imported by the loop body. The IN regions of the second J loop are 
<B(PHI1,PHI2)-IN-EXACT-{1<=PHI1,PHIK=N,PHI2==K-1}> and <A(PHI1) -IN-EXACT- 
{PHI !==!}>. After propagating upward these IN regions through statement K=K+1, 
region of array B becomes <B(PHI1 ,PHI2)-IN-EXACT-{K=PHI1 ,PHIK=N,PHI2==K}>. 
Once again, all array elements in this region are defined by the first J loop, so 
only array elements of A are imported by the code fragment in this example. 

Array analysis is also studied in many papers The convex 

array regions implemented in PIPS are based on the Regions method m where 
IN region, the set of imported array elements, is somewhat similar to Expose- 
dRead region in m UE set in [in], USE(s) in [T7] and input effects in m- 
However, our IN region and the others differ. For example, in [15|, array element 
sets are represented by lists of polyhedra and there is no exact representation, 
only under- and over-approximations. The ExposedRead sets contain array el- 
ements which are used in the continuation of the whole program before being 
defined, while IN regions are restricted to the current level in the hierarchical 
control flow graph. In [16], an array element set is a list of Regular Section 
Descriptors with bounds and step, guarded by predicates derived from IF con- 
ditions. When insufficient information is available, our MAY regions should be 
more accurate because we can keep more information about PHI variables. Input 
effects give for each array element its first use in the considered fragment 
of code. This is similar to our IN regions but the precise statement instance in 
which the reference is performed is kept in the summary. The implementation 
choice really is the trade-off between efficiency and precision. The IN regions 
developed by m are used in this paper, with some improvements. 
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K = FOOO 

C <A(PHI1) -IN-EXACT-{K=PHI1 , PHIK=N}> 

DO I = 1,N 

C <A(PHI1)-IN-EXACT-{PHI1==I}> 

DO J = 1,N 

B(J,K) = J + K 
ENDDO 

C <B(PHI1,PHI2)-IN-EXACT-{K=PHI1, PHIK=N, PHI2==K}> 

C <A(PHI1)-IN-EXACT-{PHI1==I}> 

K = K + 1 

C <B (PHI 1 , PHI2 ) -IN-EXACT-{ 1 <=PHI 1 , PHI K=N , PHI2==K- 1 }> 

C <A(PHI1)-IN-EXACT-{PHI1==I}> 

DO J = 1,N 

C <B(PHI1,PHI2)-IN-EXACT-{PHI1==J, PHI2==K-l}> 

C <A(PHI1)-IN-EXACT-{PHI1==I}> 

B(J,K) = J*J - K*K 

C <B (PHIl , PHI2) -IN-EXACT-{PHI1== J , K-K=PHI2 , PHI2<=K}> 

C <A(PHI1)-IN-EXACT-{PHI1==I}> 

A(I) = A(I) + B(J,K) + B(J,K-1) 

ENDDO 

ENDDO 



Fig. 3. Computed IN regions 



4 Used-Before-Set Analysis 

Our used-before-set analysis is directly based on IN array regions. These regions 
are computed for arrays, but scalar variables also carry the same kind of in- 
formation which is cheaper to compute. In fact, the region of a scalar has an 
empty predicate, i.e <V-IN-EXACT-{}>. Information about imported array ele- 
ments and scalar variables are propagated interprocedurally, from the elementary 
statements to the compound statements. We traverse the program call graph in 
the invocation order, in which a procedure is processed after all its callers. 

procedure Used_Before_Set_Analysis(p) 
p : current procedure 

begin 

s := entry statement of p 
I := list of variables having IN region at s 

for each v G I 

if locaLvariable(u,p) or globaLvariable_in_main_program(u,p) then 
if the IN region of u at s is MUST or EXACT then 
error: ’’Variable v is used before set” 
else /* MAY IN region */ 

insert an initialization on v before s 
go_down_and_verify(u, s) 

else /* v is a formal parameter or a global variable in a called procedure*/ 
if mus t_be -checked (u,p) then 
go_down_and_verify(u, s) 
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end 

procedure go_down_and_verify(u, s) 

begin 

for each sub-statement Si of s 

if the IN region of v at Si is EXACT then 
insert a verification on v before Si 
else /* MAY IN region*/ 

if calLstatement(si) then 

mark must _be -checked for the corresponding 
formal or global variables in the called procedure 
else 

go_down_and_verify(u, s^) 

end 

The list of IN regions at the module entry statement gives us the set of all 
possibly undefined variables of the module, and vice-versa, only variables in this 
list may be used before set. So at the entry statement, if the list of IN regions is 
empty, there is no used-before-set error in this module. Otherwise, each variable 
in the list is checked. In Fortran, a variable scope is always a module. The scope 
of a global variable declared in a module is that module but the scope of the 
common block where the variable is located is the whole program. So the IN 
regions of all global variables are propagated to the main program, although the 
variables are not declared in it. Depending on the variable type (local, formal or 
global) and the current module, we have two cases: 

Case 1: Local or global variable in the main program. Depending on the region 
approximation, we have two sub-cases: 

— If the IN region has the approximation MUST or EXACT, the variable must be 
used somewhere in the module before being defined and an error is detected. 

— Otherwise, the region is MAY; we instrument the code by inserting an ini- 
tialization function before the entry statement and go down to the sub- 
statements where the IN region is propagated from. Before each statement 
where we know with certainty that the variable must be imported, a verifi- 
cation function is inserted. We continue to go down for each statement with 
MAY regions. If this statement is a procedure call, information is added to 
mark that the corresponding formal parameters of the actual variable, or the 
corresponding global variables must be checked at the callee’s level. To help 
the debugging process, information about the call path is added to locate 
the run-time error. 

Case 2: Formal parameter or global variable in a called procedure. If this vari- 
able is marked as must be checked^ we repeat the process as for local variables, 
but no initialization is needed since it has been performed earlier in one of the 
callers. 

To trap premature usage, the initialization is implemented by assigning a 
special value to the maybe uninitialized scalar variable or array element. The 
verification checks whether the variable value is equal to this special value, and 
reports the error if one has occurred. If the variable is an array, the instrumented 
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code is a DO loop associated to the corresponding IN region, whereas it is a simple 
assignment and test for the scalar variables. All array elements in the MAY IN 
region are marked uninitialized and all array elements in the EXACT IN region 
are checked. We use the algorithm described in [21 to compute loop bounds from 
the region predicate, which is a polyhedron. This algorithm scans the polyhedron 
and uses the Fourier pairwise elimination to find loop bounds for each dimension. 
The generated loops have the following form: 



IF (condition) THEN 

DO PHIl = LOWERl, UPPERl 
A(PHIl) = special_value 
ENDDO 
ENDIF 



IF (condition) THEN 

DO PHIl = LOWERl, UPPERl 

IF (A(PHIl) .EQ.speciaI_vaIue) STOP 
ENDDO 
ENDIF 



a. Marking initialization 



b. Initialization verification 



However, there are some implementation problems related to the special value 
and the type of variable. We can use a SNaN (Signaling Not a Number) for floating 
variables but it is not evident for integer and logical variables. Currently, we 
choose the maximal integral value for integer variables, a value different from 0 
and 1 for logical variables. This may raise false positive warnings when program 
computations really involve these special values. This is also a problem for other 
compiler implementations. Some memory coloring techniques can be used to 
avoid this problem, but at the expense of memory usage. 

The efficiency of our analysis depends on the accuracy of array region anal- 
yses. The more precise the imported array regions are, the smaller the number 
of variables to be checked is, and code instrumentation is only used when we 
do not have enough information. Only array elements in the MAY IN regions are 
initialized and checked. Initialization and verification statements are inserted in 
the source code and the program is then compiied and executed normaiiy. The 
executabie code appears to the user to operate as the originai, but if a used- 
before-set error is detected, the program is stopped with a message to indicate 
the name of the variabie, the moduie and the caii path where this error occurred. 
Another impiemented option is that when a use before definition takes place, we 
do not stop the program but write details to a log-file for later analysis in order 
to catch several bugs in one run. 

To illustrate the used-before-set analysis, we use the example of plusFORT. 
Fig.H shows the IN regions computed for module QKNUM. At the module entry, 
there is one IN region for variable IVAL which means that only this formal 
variable may be used without initialization. The variable PCS is not imported by 
the loop because it is defined as the ioop index. Neither is IDS imported by the 
moduie because its vaiue has been defined by the READ statement, before being 
used in the test condition. 

Array regions of the input statement are computed by taking into account 
the Fortran standard m- if the input/output status equais to zero, neither an 
error condition nor an end-of-fiie condition is encountered by the processor, aii 
data in the input/output iist are transfered. If the input/output status is not 
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SUBROUTINE QKNUM(IVAL, POS) 

INTEGER IVAL, POS, VALSC50) , lOS 
C <IVAL-IN-MAY-{}> 

READ (11, *, lOSTAT = lOS) VALS 
C <IOS-IN-EXACT-{}> 

C <IVAL-IN-MAY-{}> 

C <VALS(PHI1)-IN-MAY-{K=PHI1, PHIK=50, I0S==0}> 

IF (lOS .EQ. 0) THEN 

C IF (IVAL.EQ.MAXINT) STOP "IVAL is undefined in module QKNUM ... 
C <IVAL-IN-EXACT-{}> 

C <VALS(PHI1)-IN-MAY-{K=PHI1, PHIK=50}> 

DO POS = 1,50 
C <IVAL-IN-EXACT-{}> 

C <POS-IN-EXACT-{}> 

C <VALS(PHI1)-IN-EXACT-{PHI1==P0S}> 

IF (IVAL .EQ. VALS (POS)) GOTO 10 
ENDDO 
ENDIF 
POS = 0 
10 END 



Fig. 4. Used- before- set analysis example 



equal to zero and there is no error or end-of-file specifier, execution of the exe- 
cutable program is terminated, no array element is defined. This is a language 
implementation feature but it must be respected to detect the uninitialized errors 
correctly. The set of array elements written by the input statement is exactly: 
<VALS(PHI1)-WRITE-EXACT-{K=PHI1, PHIK=50, I0S==0}> and when propagating 
the IN region of array VALS backward, we have: <VALS(PHI1)-IN-MAY-{K=PHI1 , 
PHIK=50, I0S==0}> - < VALS (PHI 1)-WRITE-EXACT-{K=PHI1, PHIK=50, I0S==0}> 

= <VALS(PHIl)-IN-EXACT-{}>. So all the elements of array VALS are well defined 
before they are used. There is no used-before-set error for the local variables 
in this module. By using static analysis, we prove that no instrumentation is 
needed, which is a big advantage with respect to the code generated by plus- 
FORT (Fig. [T]). Since we do not have any calling context, it is not possible to 
conclude whether the variable IVAL is already initialized by the callers of QKNUM. 
If the whole program was given, and following some call paths, IVAL may not 
be initialized, a verification would be inserted before the loop and inside the 
conditional statement, as shown in the fifth comment line in Fig.[4l 



5 Experimental Results 

We used the SPEC95 CFP benchmarks [22] that contain all kinds of variables: 
scalar and array, local, formal and global, with complicated data and control 
flow graphs. The experiments consist of two steps: IN array region computation 
and used-before-set analysis. Table [T] summarizes relevant information for each 
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Table 1. SPEC95 CFP: number of lines, modules, scalar variables (total, maybe unini- 
tialized and percentage), array variables (total, maybe uninitialized and percentage), 
compilation time (total and the used-before-set phase) and execution slowdown. 





Scalar 


Array 


Compilation 




Bench 


Line 


Mod 


Tot 


May 


Percen 


Tot 


May 


Percen 


Total 


UBS 


Slowdown 


tomcatv 


190 


1 


24 


0 


0.00% 


9 


5 


55.56% 


0:08 


0:01 


4.42 


swim 


429 


6 


42 


0 


0.00% 


14 


13 


92.86% 


0:15 


0:01 


4.19 


su2cor 


2332 


35 


276 


12 


4.35% 


118 


63 


53.39% 


5:35 


0:02 


5.43 


hydro2d 


4292 


42 


226 


11 


4.87% 


34 


7 


20.59% 


1:47 


0:05 


5.07 


mgrid 


484 


12 


49 


0 


0.00% 


10 


4 


40.00% 


1:50 


0:14 


3.77 


applu 


3868 


16 


200 


1 


0.50% 


33 


10 


30.30% 


20:48 


1:32 


6.06 


turb3d 


2101 


23 


246 


14 


5.69% 


32 


31 


96.88% 


2:03 


0:17 


6.41 


apsi 


7361 


96 


1035 


125 


12.08% 


19 


11 


57.89% 


21:05 


0:05 


1.06 


fpppp 


2784 


38 


919 


331 


36.02% 


40 


26 


65.00% 


6:39 


0:29 


12.92 


wave5 


7764 


105 


1192 


74 


6.21% 


162 


33 


20.37% 


26:46 


2:95 


UBS 



benchmark. We report the total numbers of scalar and array variables (Columns 
4 and 7), the numbers of maybe uninitialized variables detected by the static 
analysis (Columns 5 and 8) and the corresponding percentages (Columns 6 and 
9). On average, the percentages of maybe uninitialized variables to be checked at 
run-time are 3.1% for scalar variables and 37.96% for array variables. No used- 
before-set error is detected at compile-time, which is expected for benchmarks. 
All scalar variables in tomcatv, swim and mgrid are proved to be well initialized. 
One initialization and several verifications (single tests for scalar variables and 
loops for arrays) are added for each maybe uninitialized variable. 

Column 10 shows the total compilation time (in minutes and seconds) re- 
quired by PIPS to parse, compute imported array regions, analyze and generate 
code with used-before-set checks. The used-before-set analysis phase only takes 
a very small fraction of this compilation time, which is shown in Column 11. 
These times are measured on a UltraSparc II 440MHz, 256 Mo RAM. The code 
instrumented with PIPS initializations and verifications is then compiled with 
the SUN Workshop F77 version 5.0 compiler to generate executable files. This 
experiment is reported with the optimizing options turned on, using the SPEC95 
CFP measurement guidelines (f77 -fast -xarch=v8plusa -fsimple=2 -xprefetch). 
Uninitialized variables are detected in waved. In subroutine PARTBL, the local 
and static variables LCMMAX and LCMR are used before initialization. 

The execution time slowdown with the standard input data for other SPEC95 
CFP benchmarks is shown in the last column. We did not measure the slowdown 
when all references have to be instrumented, without help of static analysis, to 
show the contribution of the combined analysis, because we think that the slow- 
down given in this paper is more important in order to compare with the other 
work. On average, the instrumented code is 5.48 times slower than the initial 
code. There is only a 6% overhead for apsi. The overhead is rather high for 
fpppp (about 13 times) because of irreducible control flow graphs in this bench- 
mark. Information is lost and the approximation of array regions becomes MAY. 
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It corresponds to 36.02% of checked scalar variables and 65% of checked array 
variables . To improve the results, we can use a more sophisticated treatment 
on irreducible control flow graphs when computing array regions. On the other 
hand, some program transformations are needed for several benchmarks. For 
instance, we can reduce the total number of variables to be checked from 18 to 
14 for hydro2d by cloning the subroutine ADLEN which has two totally different 
behaviors for two parameter values: ’’half’ and ’’full” steps. This optimization 
makes the array region analysis more precise, since interprocedural information 
helps to narrow down the scope of possible effects of the called procedure. 

Other solutions are used in different compilers such as the undefined bit 
pattern or the NaN floating point trap, which can reduce greatly the execution 
time overheads. The verification on real variables can be omitted and replaced 
by the NaN exception trap. However, the origin of the uninitialized errors is not 
easy to locate with this method. Since the primary objective of this work is to 
reduce the number of possibly undefined variables by using static analysis, we do 
not intend to implement these techniques at the moment. Experimental results 
show that SPEC95 CEP are in general well-debugged programs. Used-before-set 
errors have been found in only one benchmark with its given input. 

6 Conclusion 

Static and dynamic analyses complement each other. Static analysis can discover 
automatically run-time errors and reduce the instrumentation or debugging cost. 
Dynamic checking takes into account program control flows and real input data 
that sometimes make static checking completely ineffective. Our used-before-set 
analysis combines these two approaches in order to reduce the overall cost while 
assuring the correctness of program. 

PIPS is a source-to-source compiler that can be used for program analyses, 
transformation, parallelization and verification. By reusing advanced interproce- 
dural analyses, the verification task becomes more efficient. READ and WRITE re- 
gions are already used to analyze program dependencies. IN regions are exploited 
for program optimizations such as array privatization, compile-time optimiza- 
tion of local memory or cache behavior in hierarchical memory machines, etc. 
Their precision is improved to target the verification. Only about 600 additional 
lines of C code are needed to implement the used-before-set analysis phase. 

By using the IN region analysis, the static phase can improve one’s confidence 
of the program correctness by showing that the program is free from used-before- 
set errors. Or, an error can be detected statically and the bug can be fixed right 
after the analysis. A small number of maybe uninitialized variables pointed by 
our compile-time analysis can help the testing and validation process to save 
debugging time. Run-time checks are generated only when information is not 
available to monitor the verification process. When executing the code, if a used- 
before-set error happens, the message error provides information about what 
occurred prior to the error, which can be of great help when trying to identify 
the fragment of code that actually caused the error. Eurthermore, we could 
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have an appropriate exception handling mechanism which is very important for 
safety-critical systems. 

Experimental results also show that poor code quality can make static anal- 
ysis insufficient; run-time checks remain and run-time failures cannot be elimi- 
nated. In addition to SPEC95 CEP suite, our analysis is applied to some large 
scale industrial applications to enhance the debugging process. We have encoun- 
tered other problems with used-before-set checking, e.g the type mismatch. The 
actual variable is of integer type and the corresponding formal variable is de- 
clared of real type. In the called procedure, we verify if the formal variable is 
initialized by a NaN value check, which is in fact a check on integer variable and 
this may give false results. In addition, the compiler and platform dependences 
of initialization and verification functions are also implementation problems. 

To obtain better results with static analysis, we are planning to improve 
the accuracy of IN array regions on an arbitrary control flow graph by using a 
more precise analysis, based on the control flow graph restructuring algorithm 
of Bourdoncle |23]. In addition, other approaches of array analysis such as input 
effects of Leservot [T^ can give the exact statement where the used-before-set 
error occurs. Or, as in |^, the list of complementary array sections can be kept 
when performing some convex operators on array regions, in order to have more 
precise analysis on array usage. The source function |8] can also be applied to 
the used-before-set checking problem. The question is to study the trade-off be- 
tween precision and summarization, as well as the complexity in space and time. 
Our method can be applied to other programming languages, with appropriate 
language construct handling. Eor example, the procedure call recursion can be 
handled with fixed point analysis on the call graph when computing imported 
array regions. Other problems such as pointer analysis must be studied to im- 
prove the precision of the static analysis in order to have an effective combined 
approach. The PIPS software and documentation as well as the used before set 
checking are available on http://www.cri.ensmp.fr/ pips. 
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Abstract. Aycock and Horspool have given an algorithm which im- 
proves the efficiency of GLR parsers. A grammar is ‘reduced’ so that 
there is no recursion apart from non- hidden left recursion, an FA recog- 
niser is then constructed and a stack is used when the recursive parts 
of the original grammar are required. Aycock and Horspool then give 
an algorithm which performs all possible traversals of the resulting PDA 
on a given input string. This mirrors the approach taken by Tomita to 
perform all traversals of an LR(0) FA. However, Aycock and Horspool’s 
algorithm does not terminate in the case where the grammar contains 
hidden left recursion. In this paper we give a different method for con- 
structing an FA which recognises the language generated by the grammar 
provided that the only recursion in the grammar arises from left or right 
recursion. Using this FA allows us to reduce the number of places that 
the stack is required. We also give a different algorithm for constructing 
all traversals of the final PDA which is correct in all cases, including 
grammars with hidden left recursion. Thus we can apply our algorithm 
to all context free grammars. 



1 Introduction 

It is well known that a language is regular if and only if it is defined by a one-way 
deterministic finite automaton (FA) (see for instance [ 1 ], pp 118-120) and that 
the context free languages are similarly defined by the one-way nondeterministic 
pushdown automata (PDA). Intuitively, the context free languages include those 
with properly nested bracket structures. A deterministic FA is unable to guaran- 
tee that brackets are paired correctly (the slogan has it that Tegular expressions 
can’t count’) but the addition of a stack enables correct nesting to be tested. 

The Chomsky hierarchy shows that a regular language may be described 
by a Context Free Grammar (CFG) and, although it is in general undecidable 
whether a particular CFG generates a regular language it is useful to think 
informally of CFGs as having some productions which are regular and some 
which are necessarily context free. Left and right recursion (A4>A/3, A^aA) 
produce iterated constructs in the generated language which may be described 
by regular productions. The context free parts of the language arise from those 
productions which contain embedded recursion such as A^^AS where 7 and 6 
are markers for left and right hand bracketing constructs. In detail, of course, 
embedded recursion may not be immediately obvious since the recursion may be 
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indirect. To further complicate matters, embedded recursion may be intertwined 
with left or right recursion. 

The usefulness of separating out iterative language constructs from neces- 
sarily recursive ones is clearly demonstrated by the widespread use of extended 
context free grammars [2] which directly support the use of regular expressions 
in rules. In recursive descent parsers it is usual to parse these regular expressions 
using iteration. 

It is reasonable to ask if we can algorithmically discover the context free 
core of a grammar, automatically replacing left and right recursion with regular 
expressions, providing an optimal BNF to EBNF conversion. The motivation 
here is to reduce stack activity during the parse and thus speed up the parser. 
Although such a technique would be of general application (allowing recursive 
descent parsers to replace recursive function calls with iteration, for instance) 
it is particularly interesting to analyse the behaviour of LR shift-reduce parsers 
from this vantage point since the LR languages correspond to those defined 
by deterministic PDA’s. Regular languages, in the form of the deterministic 
handle- finding automaton are at the heart of the standard LR parsing scheme. A 
stack is used to handle self-embedded recursion and right recursion. Interestingly, 
left recursion is absorbed into the handle finding automaton. In the case of 
general (non-deterministic) CFG’s multiple stacks may be needed to keep track 
of multiple putative derivations. 

Tomita |3] gave an algorithm for recognising any context free language by 
maintaining a compact representation of these multiple stacks in a ‘graph struc- 
tured stack’, allowing all traversals of an LR FA to be performed together. (Note, 
Tomita’s basic algorithm does not work for all context free grammars since the 
grammars must have had their cycles and, in some cases, their e-productions re- 
moved. Tomita modified his algorithm to handle e-rules but this algorithm still 
fails on hidden right recursion. An inelegant fix was subsequently provided by 
Farshi [3]. We have described elsewhere a modification of Tomita’s basic algo- 
rithm which works for all context free grammars [Hj which is more efficient than 
the Tomita and Farshi variants.) 

Aycock and Horspool m have given a parsing method in which a grammar 
is ‘reduced’ so that there is no recursion apart from non-hidden left recursion. 
An FA recogniser for the reduced grammar is then constructed, and a stack 
is used when the recursive parts of the original grammar are required. Aycock 
and Horspool also give an algorithm which performs all possible traversals of 
the resulting PDA on a given input string. This mirrors the approach taken by 
Tomita to perform all traversals of an LR FA. 

Aycock and Horspool’s algorithm does not terminate in the case where the 
grammar contains ‘hidden’ left recursion. This is also the case for Tomita’s algo- 
rithm but the reasons for the failure to terminate in the two cases are different. 
Tomita’s algorithm really failed on hidden right recursion but he introduced 
a modification to correct the problem. It was this modification which in turn 
failed with respect to hidden left recursion (for further discussion of this issue 
see O). In Aycock and Horspool’s case if a grammar non-terminal, A say, has 
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the property that A^aA(3^ where a ^ e then the automaton which recognises 
the language generated by A calls itself recursively. If a^e (that is, the gram- 
mar contains hidden left recursion) then the automaton will repeatedly call itself 
without consuming any input, and hence will fail to terminate. 

In this paper we describe the construction of a reduction incorporated au- 
tomaton from a grammar which we can prove (see E) correctly recognises the 
language generated by the grammar provided that the only recursion in the 
grammar essentially arises from left or right recursion. Using this automaton 
rather than the one constructed by Aycock and Horspool allows us to reduce 
the number of places that the stack is required in the final algorithm. We give a 
new algorithm for constructing all traversals of the final PDA and we can prove 
that this algorithm is correct in all cases, including grammars with hidden left 
recursion. Thus we can apply our modified parsing method to all context free 
grammars. The theorems in this paper are stated without proof, but full formal 
proofs can be found in the techincal report [S] on which this paper is based. 

2 Initial Definitions 

A context free grammar consists of a set N of non-terminal symbols, a set T 
of terminal symbols, an element S' G N called the start symbol, and a set of 
grammar rules of the form A::=a where A G N and a is a (possibly empty) 
string of terminals and non-terminals. We assume that there is an augmented 
start rule, S'::=S, so that S' does not appear on the right hand side of any 
grammar rule. 

A derivation step is an element of the form ^AS^jaS where 7 and 6 are 
strings of terminals and non-terminals and A::=a is a grammar rule. A deriva- 
tion of r from cr is a sequence of derivation steps cF^j3i^j32^ • . • 

* + 

We write cr^r and cf^t if n > 0. 

A sentential form is any string a such that S^a and a sentence is a sentential 
form which contains only elements of T. The set, L{r), of sentences which can 
be derived from the start symbol of a grammar T, is defined to be the language 
generated by F. 

A string a is nullable if a^e and null if a^u G T* implies that u = e. We 
say that a grammar has left (or right) recursion if there is a non-terminal A 
and a derivation A^aAjS where a is nullable (or (3 is nullable). We say that the 
recursion is hidden if a ^ e (or 7 ^ e) . A grammar has proper self embedding if 
there is some non-terminal. A, and non- null strings a, (3 such that A^aA(3. 

A finite automaton (FA) consists of a set of states and a set of transitions 
between these states. One of the states is singled out to be the start state, and 
one or more states are designated as accepting states. The transitions are labelled 
with grammar symbols together with the empty string e. For technical reasons 
we shall want to label some of the transitions with special versions of e which 
correspond to ‘performing a reduction by rule i\ We denote these as IZi. 

A path is a sequence 0\ . . .Ok of transitions in the FA such that the source 
state of Oij^i is the target state of for 1 < i < /c — 1 . A path through the FA is 
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a path Oi . . .Ok such that the source state of 0i is the start state and the target 
state of Ok is an accepting state. For a path 6 >, we write 0 for the string of terminal 
and non-terminal symbols obtained by taking the labels of the transitions in 0 
and removing the e and IZ symbols. We say that a string /i of grammar symbols 
is accepted by an FA, N, if there is a path 0 through N such that 0 = fi. 

3 Reduction Incorporated Automata 

Parsing involves comparing a sentential form with the rules of a grammar so as 
to detect derivation steps and thus derivations. It is natural to render a grammar 
as an FA in which the states correspond to slots in the grammar, that is positions 
between grammar symbols, and the edges to the matching of grammar symbols. 
The standard non-deterministic LR(0) automaton is based on this approach: a 
slot is represented as an item^ a rule of the form X\\=a • /5, where X ::=af3 
is a grammar production rule. The automaton includes a state for each item, 
X \ \= a- xj3 and X ::= ax- f3 are connected via an edge labelled x, and X ::=a'A(3 
and A::= -7 are connected via an edge labelled e. The accepting states of the 
FA are those with no out-edges, corresponding to items of the form A ::= 7 *. The 
final LR(0) automaton is then obtained by performing the subset construction. 

Aycock and Horspool’s central idea is to add additional ‘reduction’ transi- 
tions from accepting states of the FA to the state which would be reached after 
the corresponding reduction had been performed. Precisely how these reductions 
should be introduced is slightly subtle and a full discussion is given in [^ . It turns 
out that we cannot simply use the LR(0) automaton. Aycock and Horspool give 
a method for constructing their FA which uses tries [E|. We use a different ap- 
proach based on our Reduction Incorporated Automaton (RIA). The following 
is an informal description; a formal definition is given in Section [33 

(i) Create an expanded LR(0) automaton by ‘multiplying out’: each occurrence 
of a nonterminal on the RHS of a production causes the entire set of items 
for that non-terminal to be added afresh. In the case of recursive rules, add 
an e-edge back to the most recent instance of the target item on a path from 
the start state to the current state. 

(ii) Add Aycock and Horspool style ‘reduction’ transitions (labelled with 1Z) 
from the leaves of the multiplied out FA (which would correspond to accept- 
ing states in the LR(0) automaton) to the state corresponding to the con- 
sumption of the accepted non-terminal. The resulting automaton is called 
the first stage, or initial RIA (IRIA). 

(hi) Remove transitions labelled with non-terminals, since the final RIA will only 
be used to match strings of terminals. 

(iv) Perform the subset construction, with 7^-transitions treated as non-e edges, 
to remove some non-determinism. 

Example 1 Given Ti, a right recursive grammar: 



0 . S'::=S 
3.A::=bS 



1 . S::=e 
4. A::=Dg 



the IRIA resulting from construction steps (i) and (ii) is 



2 . S::=aA 
5. D::=b 



236 



A. Johnstone and E. Scott 




In this case the difference between this and the standard LR(0) approach is that 
there are two states labelled S 

The back-edge from state A::=b • S to S ::= • a A and the corresponding 712 
transition from S\\=aA- to A\\=bS- arise from the recursive occurrence of S 
in the rule A::=bS. These recursion back-edges indicate the points at which 
the ‘multiplying-out’ of the LR(0) automaton would yield an infinite automaton 
unless a cycle is created. 

After application of construction steps (hi) and (iv) we obtain RIA(Ti): 




The existence of the 7^-transitions means that the automaton is still non- 
deterministic. We could reduce the non-determinism further by assigning ‘looka- 
head’ sets to the 71 transitions. However, this would still not always resolve 
all non-determinism, so we shall discuss the addition of lookahead symbols to 
the final push down automaton rather than introduce them at this intermediate 
stage. 



3.1 Formal RIA Construction Algorithm 

The results given in this paper (and the corresponding proofs given in M) are 
based on the following formal RIA construction algorithm. 

Given an augmented grammar F we construct an RIA as follows: 

Step 1 : Create the start node labelled S' ::= • S. 
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Step 2 : While there are nodes in the FA which are not marked as dealt with, 
carry out the following: 

1 . Pick a node K labelled X :: = /i • 7 which is not marked as dealt with. 

2 . If 7 7^ e then let 7 = x^' where x G NUT, create a new node, M, labelled 
X :: = /ix • 7', and add an arrow labelled x from K to M. This arrow is defined 
to be a primary edge. 

3 . If X = y, where T is a non- terminal, for each rule Y ::=S: 

if there is a node L, labelled Y ::= • S, and a path 0 from L to K which 
consists of only primary edges and primary e-edges {0 may be empty), 
add an arrow labelled e from K to L, 

otherwise, create a new node with label Y ::= • S and add an arrow 
labelled e from K to this new node. This arrow is defined to be a primary 
e-edge. 

4 . Mark K as dealt with. 

Step 3 : Remove all the ‘dealt with’ marks from all nodes and mark the node 
labelled S' ::=S' as the accepting node. 

Step 4: While there are nodes labelled Y ::=7* that are not dealt with: pick a 
node K labelled X ::= xi . . . x^' which is not marked as dealt with. Let T ::=y 
be rule L If X 7 ^ 5" then find each node L labelled Z ::= 6' Xp such that there is 
a path labelled (e, xi, . . . , Xn) from L to X, then add an arrow labelled 1Zi from 
K to the child of L labelled Z ::=SX • p. Mark K as dealt with. (The new edge 
is called a reduction edge). 

Step 5 : Remove all arrows labelled with nonterminals (this does not make any 
node unreachable because there is a reduction arrow with the same target as 
each removed arrow). 

Step 6 : Perform the subset construction with edges labelled IZi treated as non-e 
edges. 

Theorem 1 Let F he an augmented grammar and let RIA(F ) he the assoeiated 
automaton eonstrueted as above. If a is a non-trivial sentential form of F then 
a is aeeepted hy RIA(F ). 

Furthermore, if F does not eontain any proper self embedding and G T* is 
aeeepted hy RIA(F ) then S'^u. 

In the next section we will describe how to deal with grammars which do 
contain proper self embedding. 

4 Generalised Regular Parsing 

In this section we describe how to build a PDA which can be used to recognise 
sentences in a language generated by a given context free grammar. The method 
is an extension of the construction given by Aycock and Horspool [B]. 

As we mentioned in the introduction, the problem with Aycock and Hor- 
spool’s method is that, like any recursion based method, if a process makes a 
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recursive call to itself without consuming any input then the method will ulti- 
mately fail to terminate unless special terminating measures, such as limiting the 
number of recursive calls, are introduced. An automaton constructed by Aycock 
and Horspool’s method can make a recursive call to itself without consuming 
any input if and only if the original grammar contains hidden left recursion. 
(Non-hidden left recursion does not generate a recursive call because it is ab- 
sorbed into the appropriate state when the automaton is constructed. This is 
exactly analogous to LR(0) FAs which, for the same reason, admit direct left 
recursion but not hidden left recursion.) Elsewhere, |S], we have given an ex- 
tension of Tomita’s original algorithm which allows e-productions and can cope 
with grammars which contain hidden left recursion because the algorithm checks 
whether a path in the GSS already exists before adding it again. Using a similar 
idea, we have modified the call structure used by Aycock and Horspool to allow 
us to determine when a call is being repeated without any input having been 
consumed, thus ensuring that our algorithm always terminates. 

In the rest of this section we shall describe our generalised regular parsing 
algorithm. We begin by modifying the grammar to remove most of the recursion 
(which changes the language generated by the grammar). We then construct the 
RIA for the modified grammar as described in the previous section, together 
with RIAs for certain subgrammars. We describe how to construct, from these 
automata, an automaton, called a recursion call automaton (RCA) , for the orig- 
inal grammar F which, together with a stack, can be used to recognise sentences 
of r. We then give our algorithm for computing the results of all possible traver- 
sals of the RCA for a given input string. 

4.1 Recursive Call Automata 

Given a grammar, T, if there is a non-terminal A and a derivation A^aAp^ 
where a and p are not null, then pick one such A and replace an instance of 
A on the RHS of a rule with a special terminal of the form A^ so that this 
derivation is no longer possible. Repeat this process until the derived grammar, 
I^, has no proper self embedding, and construct RIA(/ 5 '). 

For each special terminal A^ in lU construct the grammar Fa which has the 
same rules as Fs but with the addition of a new start rule Sa A, and then 
construct RIA(Ta) in such a way that all the state labels are disjoint from the 
state labels of any other automaton we have constructed during this process. We 
link all these automata together as follows: for each transition anywhere in any 
of the automata labelled A^ let the source node be labelled h and the target 
node be labelled k. Remove this transition from the automaton and add a new 
transition from node h to the start node of the automaton /A, labelling the new 
transition p{k). Label the accepting node of RIA(/A) with pop. The start and 
accepting states of the RCA are the start and acceptiing states, respectively, of 
RIA(/ 5 '). We shall refer to this new automaton as the recursion call automaton 
(RCA) associated with T, RCA(T). 

We can reduce the non-determinism in RCA(T) by adding lookahead sym- 
bols to the transitions. All reduction and push transitions whose target has a 
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transition labelled x G T or a transition with x in its lookahead set have x in 
their lookahead sets. The pop action has a lookahead set which is the union of all 
the lookahead sets associated with all the transitions from states which can be 
reached when this pop action is performed (i.e. the targets of the transitions la- 
belled in the reduction incorporated automaton). Also, actions whose target 
is an accepting state have $ in their lookahead sets. 

Example 2 Consider the grammar F which contains proper self embedding on 
B and hidden left and right recursion on S. (Note the string A is null so the 
derivation S^BSA does not constitute proper self embedding.) 

5" ::= S B ::= DBb \e A ::= e 

S ::= BSA \e D ::= a \ e 

We remove the proper self embedding by replacing the second instance of B with 
a special terminal, 5^, resulting in the grammar Fs 



0. S' 


::= S 


3. B 


::= DBH 


6. D 


::= e 


1. 5 


::= BSA 


4. B 


::= e 


7. A 


::= e 


2. 5 


::= e 


5. D 


::= a 







The first stage reduction incorporated FA, IRIA(/ 5 ') is 





(Here X = {a, b} and Y = {a, 6, $}.) 

A traversal of an RCA starts in the start state with a base node on the call 
stack. Then, if we have reached state h and the next input symbol is a, either. 
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— move to state k along a transition labelled (T^i, Z) where a G Z, or 

— move to state k along a transition labelled (p(/), Z) where a G Z, and push 
I on to the top of the call stack, or 

— if h is labelled {pop, Z) and a G Z, pop a symbol, /, off the top of the call 
stack and move to state /, or 

— move to state k along a transition labelled a and read the next input symbol. 

If we have reached an accepting state of RCA(I^) and all the input has been 
consumed, report that the input has been accepted. Otherwise, if no further 
transitions are possible, report that this traversal has not been successful. 

Theorem 2 A string, u, of terminals is in the language generated by F if and 
only if there is a traversal of RCA{F) whieh aeeepts the input u. 



4.2 Traversing an RCA 

We now consider how to determine whether or not there is a traversal of RCA(T) 
on a given input string u = ai . . . a^. The basic idea is to traverse the RCA and 
record the states we can reach on the input that we have seen so far. The process 
proceeds in steps, one step for each input symbol and one for the last symbol, 
$. We start in the start state and construct the set of all states which can be 
reached without consuming any input. We then start a new set which contains 
all states which can be reached from a state in the first set along a transition 
labelled ai. We then add all the states which can be reached without consuming 
any further input. If we encounter a transition labelled with a push action then 
we need to move to the state which is the target of this transition and to record 
the state we need to move back to, the argument of the push label, to be used 
when we reach the corresponding pop action. Thus we instead of just states, we 
maintain a set U of (state, node) pairs which can be reached from the currently 
read input. 

The possibility of nested calls and multiple alternatives where the RCA is 
non-deterministic mean that we need an efficient method of recording the return 
states. Following Aycock and Horspool we create a call graph which is structured 
in a similar way to Tomita’s graph structured stack and associate each state we 
reach in a traversal with a node in the call graph. When a push transition is used 
we find or create a node in the call graph which is labelled with the return state 
and we record the corresponding node with the state. We begin all traversals by 
creating a base node in the call graph, go, labelled —1 (which is not the label of 
any RCA node). We need to record the set of call graph nodes constructed at 
each step in order to check whether a node with a particular label has already 
been constructed, because in this case the node is reused. For this we use a set 
P. 

We shall see that left and right recursion in P^ cause loops of reductions in 
RCA(P). We ensure that the traversal construction process terminates in such 
cases by only adding each pair {k,q) to the set U once at any given step in 
the process. It is also possible to have loops which consume no input if in the 
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grammar Fs we have A^aA^j3 where a^e. In this case the loop involves a 
push to the start state of Fa and is the source of the problem with algorithm 
given in |6]. We deal with this problem using an idea similar to that used by 
Farshi in his modification of Tomita’s algorithm: we introduce loops in the call 
graph. Before giving the formal algorithm we illustrate our approach with two 
examples. 

Computing Traversals Using Example 2 

Recall the grammar from Example 2 above and consider traversing the RCA 
with input string ab$. 

We begin in the start state with lookahead symbol a and single stack node go? 
so U = {{O^qo)} = P. From state 0 we can reach states 3 and 4 along reduction 
transitions, so U = {(0, Qq), (3, Qq), (4, go)}- From state 3 we can return to state 3 
along a reduction but as (3, go) is already in U we do not add it again, to ensure 
that this step terminates. State 4 has a push transition so we create a new call 
graph node, gi, labelled with the argument of this transition, 8, and we make gi 
a parent of the node, go, and we add (12, gi) to U and P. 




With lookahead symbol a from state 12 we can reach state 14 along a reduction 
transition, so we add (14, gi) to U. From state 14 there is a push transition 
p(15) so we create a new call graph node Q 2 labelled 15 and an edge from Q 2 to 
gi, and add (12, g 2 ) to U and P. From (12, g 2 ) we add (14, g 2 ) to P, then from 
(14, g 2 ) we traverse the push transition labelled p(15). There is already a call 
graph node, g 2 , labelled 15 so we reuse this node and create an edge from it to 
Q 2 . Since (12, g 2 ) is already in U we do not add it again. 




This step of the construction is now complete and we have U = {(0, go), (3, go), 
(4, go), (12, gi), (14, gi), (12, g 2 ), (14, g 2 )|- We then read the input symbol, a, and 
check each of the elements in U for transitions labelled a. The states reachable 
in this way form the basis of the new set U for this step. Thus we have U = 
{(2, go), (13, gi), (13, g 2 )| and P = 0. We then traverse the reductions from 2 
and 13, adding (4, go), (14, gi) and (14, g 2 ) to U. States 4 and 14 have push 
transitions, so we create new call graph nodes, gs, g 4 labelled 8 and 15 (see the 
diagram below), and add (12, gs) and (12, g 4 ) to U and P. We then traverse the 
transitions from state 12 and add (14, gs), (14, g 4 ), (11, gs) and (11, g 4 ) to U. 
When we process (14, gs) and (14, g 4 ) we see that (12, g 4 ) G P and has label 
15, so we reuse this node and just add edges from g 4 to gs and g 4 . 




15 
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State 11 has an associated pop action thus, when we process (11, gs), for all nodes 
which are children of gs, in this case go, we pop the label, 8, of gs off the stack 
and move to state 8 with the new stack, whose top node is go, i.e. we add (8, go) 
to U. Similarly, for (11, g 4 ) we add (15, g 4 ), (15, gs), (15, g 2 ) and (15, gi) to U. 
This step is then complete with 

U = {(2, go), (13, gi), (13, g 2 ), (4, go), (14, gi), (14, (12, gs), (12, g 4 ), (14, gs), 

(14, g4), (11, gs), (11,^4), (8, go), (15, g4), (15, gs), (15, g 2 ), (15, gi)}. 

Finally we read the b and set U = {(9, go), (16, g 4 ), (16, gs), (16, g 2 ), (16, gi)}. 
With lookahead $ there is a reduction from state 9 to 3, then to states 5, 6 and 10. 
SoU = {(9, go), (16, g4), (16, gs), (16, g 2 ), (16, gi), (3, go), (5, go), (6, go), (10, go)}. 
We have now completed the process, the lookahead symbol is $ and there is an 
element (10, go) G U whose state is an accepting state of the RCA and whose 
call graph node is the base node. Thus we accept the input string, i.e. ab e L{r). 

There is one further issue that we need to address before we give the for- 
mal algorithm. When we perform a pop action on a node, g labelled k, in the 
call graph we need to create elements in U for each of the children of g. If we 
subsequently add a new child p to g then we need to ensure that (/c,p) is added 
to U. Thus when we create new edge from g to p we check to see if U contains 
any elements which result in a pop action from g. If such elements exist then we 
ensure that U contains {k,p). 



Example 3 

Consider the grammar T, S' ::= S S ::= SSSb \ e. 

We remove the proper self embedding, resulting in the grammar Fs 

0. S' := S 1. S ::= SS^SH 2. S ::= e 



Then RIA(Cs) and RCA(T) are 




S::=S-S^S 







S-.:=SS^S^1F ‘ — (S-.:=SS^S^ -b) 






P(2){b,$} 







We traverse RCA(T) with input string b$. Starting with the element (0,go) we 
traverse the reduction and add the element (l,go) to U. We then traverse the 
push transition, creating a new call graph node gi labelled 2 and adding (0,gi) 
to U and to P. We then traverse the reduction from (0,gi) and add (l,gi) to 
U. The pop action in state 1 causes (2, go) to be added to P, and, as there is 
already a call graph node labelled 2 which has been constructed at this step, 
the push transition from (1, gi) results in the construction of an edge from gi to 
itself (see the diagram below). This creates a new edge from gi down which the 
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pop action associated with (l,gi) must be applied, and thus the element (2,gi) 
is added to U. If we then traverse the push transition from (2, go) we create a 
call graph node, q 2 labelled 3 with child go? and add (0,g2) to U and P. When 
we traverse the push transition from (2,gi) we find that there is already a call 
graph node labelled 3 so we just add an edge from g2 to gi. We then traverse 
the reduction from (0,g2), adding (l,g2) to I/, and perform the pop action for 
each of g2’s children, so that 

U = {(0,go), (l,go)? (0,gi), (l,gi), (2, go), (2,gi), (0,g2), (l,g2), (3, go), (3,gi)}. 

The push transition from (l,g2) causes an edge to be added from gi to g2, and 
then the pop action associated with (l,gi) is applied down this edge, adding 
(2,g2) to U. Finally, the push transition from (2,g2) causes a new edge to be 
added from g2 to itself, and the pop action associated with (l,g2) then causes 
(3, g2) to be added to U. 




This completes the first step of the process. We then read the next input symbol, 
6, and continue traversing the RCA. Ultimately, when this step is complete we 
have 




u = {(4, qo), (4, Qi), (4, Q2), (1, go), (1, qi), (1, 92), (2, qo), (2, qi), (2, 92), (3, qo), 

(3, qi), (3, 92), (0, qz), (0, 94), (1, 9s), (1, Qi), (2, qz), (2, 54), (3, qs), (3, 94)}- 

Since the next input symbol is $ and U contains (1, go), where 1 is an accepting 
state of RCA(T), the string h is accepted. 

4.3 A Formal Recognition Algorithm for RCA(JT) 

We shall now give the algorithm which computes the results of all possible traver- 
sals of an RCA for a given input. We shall assume that the RCA is given in the 
form of a table, T, whose rows are indexed by the state numbers of the RCA, 
with the start state by convention being numbered 0, and whose columns are 
indexed by the terminal symbols of the grammar and the end-of-string symbol, 
$. The entries in the table are sets of actions. If there is a transition in RCA(T) 
from state h to state k labelled with the terminal a, then T{h,a) contains the 
action sk. If there is a transition in RCA(T) from state h to state k labelled 
{p{l),Z) or {pop^Z) then, for all x E Z^ T{h^x) contains the action 
(T^i, /c), (p(/), k) or pop respectively. 
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We begin with the RCA, input ai . . . a^$, and a recursion call graph which 
contains a single node labelled —1 which is not the label of any state in the RCA. 
At the end of each step in the process we have a set U of RCA nodes which can 
be reached using the portion of input consumed so far, together with the node 
which is the top of the associated call stacks. In the algorithm we shall call this 
set Ui to facilitate the exposition. At the beginning of each step we have a set 
Ui-iQi of nodes which can be reached from the previous set via a shift action on 
the input symbol, a^, which has just been read. 

The nodes of the recursion call graph are all labelled with state numbers 
from the RCA, except for a unique base node which is labelled —1. For every 
node in the call graph there will be a path from this node to the base node. In 
practice the labels on the call graph nodes will all be states in the RCA which 
appear as parameters to p{) transitions. 

Input: an RCA written as a table T, and a string ai . . . 

<^n+l = Uq = Pq = . . . ^Un = Pn = ^ 

create a base node, go? ia the call graph 

create a process node, uq, in Uq labelled (0,go) and add (0,go) to Pq 
for i = 0 to n do { 
add all the elements of Ui to A 

while A 7^ 0 { 

remove u = (h^q) from A 

if sk G { if there is no node labelled {k,q) in Ui^i { 

create a process node v labelled (/c, q) 
add V to Ui-^i } } 

for each (IZi^k) G T{h^aiJ^\) { if there is no -i; G Ui labelled (/c,g) { 

create a process node v labelled (/c, q) 
add V to A and to Ui} } 

if pop G T{h^ «i+i) { let k be the label of q and Z be the successors of q 
for each p G Z { 

if there is no G Ui labelled {k^p) { 
create a process node v labelled {k^p) 
add V to A and to Ui }} } 
for each (p(/),/c) G T(/i, a^+i) { 

if there is (/c, t) G Pi such that t has label I { 
if there is no edge from t to q { 
add an edge from t to g 
if there is no node in Ui with label (/,g) { 

if there is v e Ui\A with label of the form (/, t) 
and pop G T(/,ai+i) { 
create a process node v labelled (/,g) 
add V to A and to Ui } } } } 
else { create a node t with label I in the call graph 
make g a successor of t 
create a process node v labelled (/c, t) 
add V to A, to Ui and to } } } } } 
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if Uji contains a node whose label is (hoo, ^o) where h^o is an accept state 
of the RCA and go is the base node of the call graph { report success } 
else { report failure } 

Theorem 3 Given RCA{r) and an input string a\ . . .a^S; Algorithm \4-3\ ter- 
minates and reports sueeess if a\ .. .On is in the language generated by F and 
terminates and reports failure if ai . . .a^ is not in the language. 

5 Conclusion 

The techniques described in this paper form a two-level generalisation of the 
finite automata that underlie parsing. We construct an FA (the Reduction In- 
corporated Automaton) which recognises the regular parts of a language and 
use this as a basis for another FA (the Recursive Call Automaton) which uses a 
stack when we need to recognise recursive parts of the language. 

The RCA is nondeterministic, and so we need to manage multiple stacks. 
Tomita’s GSS is a general-purpose structure for maintaining multiple stacks 
which may share prefixes and, by virtue of there being a finite set of values 
that may appear at the top of those stacks, may also be merged. In a Tomita 
parser, merging occurs when two stacks have the same LR state on top. In the 
RCA, merging occurs when two stacks have the same RCA state on top. A key 
difference is that Tomita must maintain a complete trace of all stack activity 
because the GSS is used to manage reductions, and during a reduction the parser 
needs to be able to look down into the stacks to see which state to go to. This 
is unnecessary in the case of the RCA which only ever needs to look one level 
back to find the state it needs to go to. 

The use of a GSS-like structure, and the applicability of our approach to 
generalised parsing might create the impression that our parser is Tomita-like. 
Indeed Aycock and Horspool whose reduction transitions inspired this work de- 
scribe their algorithm as an optimisation of Tomita’s algorithm. In fact, the 
short-circuiting of the reduction path search by the reduction transitions means 
that there are few points of contact between the approaches. Tomita’s algorithm 
and the variations of it, for example those given by Farshi [Ij and Rekers m, 
are generalised LR algorithms in the sense that if they are given an LR gram- 
mar then they essentially behave like a traditional stack based LR parser. In our 
case, and that of Aycock and Horspool, the parsers behave like FAs on regular 
grammars. 

We would like to be able to produce an efficient version of what could be called 
a generalised regular parser in the sense that the algorithm is essentially an FA in 
the case where the input grammar defines a regular language. However, with the 
method described here this cannot be fully achieved because it is possible to have 
a grammar which contains non-trivial self embedding but whose language is still 
regular, for example S ::=aSa \ e. For this grammar the RIA would accept, for 
example, a^. Thus, following Aycock and Horspool, our algorithm has an initial 
phase which modifies the input grammar if it contains proper self-embedding and 
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in some cases the algorithm uses a stack even though the underlying language 
is regular. 

We have said little about the production of derivations from our parser since 
the scheme presented here is essentially just a recogniser. We have investigated 
two approaches: the production of Tomita-like Shared Packed Parse Forests and 
the construction of an FA whose language is essentially the set of all possible 
right-most derivations of the sentence. Space precludes a full discussion here, but 
the main issue is to ensure that production of derivations does not significantly 
compromise the efficiency gains from the reduction in stack activity. For this 
reason we prefer the second of the two approaches, see jH] for further details. 
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Abstract. We have developed Tm, a template-based metacompiler. Given a set 
of data- structure definitions and a template, Tm generates files that instantiate the 
template for the given data structures. With this process, Tm is able to generate 
program code to manipulate these data structures. Since it uses templates, the gen- 
erated code is not restricted to a specific programming language: any sufficiently 
powerful programming language can be targeted. 

Tm has been used for a wide variety of tasks and languages. However, it was 
designed to support compiler construction, and most applications have been in 
that area. 

In this paper we outline T m, and describe our experiences with using it to construct 
a static compiler for Java. As we will show, it has significantly accelerated im- 
plementation of the compiler. Almost 75% of its source code is generated by T m, 
allowing us to rapidly implement a much more robust and sophisticated compiler 
than would have been possible otherwise. 



1 Introduction 

In an earlier paper (jS] we described Tm (short for Template Manager), a template code 
generator. Given a set of data-structure definitions and a template, T m generates an output 
file that is an expansion of the template using the data structure definitions (Fig.[TJ. 



source code template 



data-structure definitions 



t t 

Tm 



source code 



Fig. 1. Given a source code template file and a set of data-structure definitions, T m generates a 
source code file. 
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For example, the following definitions could be used to represent connections be- 
tween electronic components: 

connection = Wire: { name: string } | Bundle: { 1 : [connection] }; 

A connection is either a single wire or a bundle of connections, represented by types 
Wire or Bundle respectively. Both are subtypes of connection. A Wire contains a 
string field, a Bundle contains a list of connections. 

Now consider the following Tm template: 

.foreach t $-[typelist} 
typedef str_$t *$t; 

. endf oreach 

The two lines starting with a dot form aim command that iterates over the defined types, 
and assigns the current type to variable t. The remaining line is written to the output in 
each iteration. The two $t expressions are references to variable t that are substituted 
by Tm. 

Executing this template using the type definitions shown above results in: 

typedef str_Wire *Wire; 
typedef str_Bundle ^Bundle ; 
typedef str_connection *connection; 

Tm templates are programs for the Tm macro language. When executed, these pro- 
grams can generate source code for another programming language. This process is 
called metaprogramming, since it is a metalevel above ‘normal’ programming. The ap- 
proach that T m uses is called static metacompilation or template metaprogramming since 
it is done at compile time, not at runtime. 

Because it uses templates, Tm is neutral with respect to the target language of the 
generated files. Various users have written templates for programming languages such as 
Miranda, Pascal, C, C++, Lisp, Clean and Java, but also for targets such as Unix shells, 
the Unix streaming editor (sed), and configuration files for various programs. The most 
common target is the C programming language. 

Tm supports file inclusion in its templates, so code can be shared between projects, 
and standard templates can be provided for common code. An extensive set of standard 
templates have been developed for C, and for many programs the code provided by these 
templates is sufficient. 

Tm has proved to be very useful in a large variety of projects. To illustrate this, we 
will examine the use of Tm and its C templates in the construction of Timber 191101 . 
a parallelizing Spar/Java compiler. Using Tm has had a profound impact on the imple- 
mentation of Timber. Nearly 75% of its source code is generated by Tm. Code templates 
strongly encourage code reuse, since a code template section is repeatedly expanded, 
and entire code templates are re-used between projects. Moreover, code templates can 
automate a number of error-prone tasks, such as dependency calculations between node 
types. For all these reasons, using Tm has allowed us to implement a far more robust, 
powerful and adaptable compiler than otherwise would have been possible. 

The source code of the Timber compiler is available for downloading from El, the 
source code of Tm itself is available for downloading from the Tm website 0. 

The paper is organized as follows: In Section[2|we describe related work. In Sections!!] 
and [4] we give an overview of Tm. In Section [5] we describe the C templates of Tm. In 
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Section[6]we describe the Timber compiler and the impact of Tm on its implementation, 
and in Section[7]we draw some conclusions. 



2 Metaprogramming Languages 



In a sense, the most popular metaprogramming language is formed by the directives of 
the C/C++ preprocessor. Unfortunately, it is a very weak language lacking even simple 
features such as iteration or string manipulation. Generic macro processors such as the 
Unix tool ‘m4’ have also been used for metaprogramming, but such a macro processor 
has no knowledge about the data- structure definitions for which code must be generated. 
However, for effective metaprogramming the metaprogramming language must know 
the data- structure definitions of a program, and the relation between them. This allows 
the metaprogram to generate code that is tailor-made for a specific data structure. 

Knowledge about the data- structure definitions can be provided at run time or at 
compile time. At run time, the knowledge can be provided through a set of inquiry 
functions that list the types in a program, list members of a type, etc. Such inquiry 
functions are available in languages like Java, Smalltalk, and Python. This approach is 
called dynamic metaprogrammimg. For example, Java associates a j ava . lang . Class 
object with every object in a program. Query methods of the Class object allow the 
program to list methods, constructors and fields of the class, and to obtain detailed 
information about these methods. 

The advantage of dynamic metaprogramming is that the same language is used 
for programming and metaprogramming, obviating the need to learn a new language. 
However, since the metaprogramming is done at run time, is difficult to compile the 
output of the metaprogram. The alternative, interpretation, results in slower execution. 
Also, dynamic metaprogramming languages are usually not designed for large-scale 
metaprogramming, so that extensive templates are cumbersome to implement. Finally, 
dynamic metaprogramming is inherently restricted to a single programming language. 

When the knowledge about the data- structure definitions is provided at compile time, 
this is called static metaprogramming or static metacompilation. Knowledge about the 
data-structure definitions can be extracted from the target language, or can be provided 
as definitions in the metaprogramming language. The first approach requires tight in- 
tegration with the programming language under it. It is used, for example, in Willink’s 
Flexible Object Generator II 11121 (FOG). He replaces the standard preprocessor of C++ 
with a much more powerful metacompiler integrated with C++. Unfortunately, although 
FOG can access C++ class definitions, it does not allow computations on the relations be- 
tween classes. This is a restriction of FOG, not one that is inherent to the used approach. 
The approach is inherently restricted to a single programming language. 

It is also possible to construct a static metacompiler that is independent of the under- 
lying programming language. With this approach, the data-structure definitions are part 
of the metalanguage, and metaprograms must generate data-structure definitions for the 
target language. This makes the metacompiler fully independent of the target language. 
This is the approach used by Tm, although we strictly separate the macrolanguage and 
the data-structure definition language. 
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The same approach is used by AutoGen m . It shares many features with T m, but was 
designed for general code construction tasks. In contrast, Tm was designed to generate 
manipulation code for data structures, and in particular to assist in compiler construction. 
AutoGen ’s macro language does not have Tm’s rich set of functions and commands to 
access and manipulate data- structure definitions. Also, it lacks the rich set of C templates 
that Tm provides. 

3 Tm Data-Structure Definitions 

A Tm data- structure definition file, such as the one shown in Fig.|2] consists of a series 
of definitions of Tm types. A Tm type is either a class, or a tuple. For example, in Fig. |2] 
origin is a tuple type, all others are class types. Both classes and tuples can contain an 
arbitrary number of fields. 



expr = { org: origin } + 

VarExpr: { nm: string } | 

AddExpr : { 1 : expr , r : expr } | 

SubExpr : { 1 : expr , r : expr } | 

NegExpr: { x:expr } | 

ConstExpr: -[ n:int } | 

CallExpr: { fn: string, parms : [expr] }; 

origin == ( file: string, line:int ); 



Fig. 2. A typical set of Tm type definitions to represent expressions in a programming language. 



3.1 Fields 

Each field of a tuple or class consists of a name and a type. The type can be either a 
simple type, written as the name of the type; or a list type, written by surrounding a type 
with a square bracket pair (‘ [’ and ‘] ’). List types denote lists of arbitrary length, whose 
length can change at run-time. For example, the following are all valid fields: 

linerint file: string 

points: [point] words :[ [char] ] 



3.2 Class Types 

In its simplest form, a class type consists of a list of fields separated by commas, and 
surrounded by curly braces. Like all type definitions, it must be terminated by a semicolon 
For example: 

origin = { file: string, line:int 

A class can also inherit from other types. For example: 

if Statement = statement + ■[ cond:expr, then: block, else: block }; 

means that the if Statement class inherits the fields of the statement class. 

A class can be defined to be virtual by using the operator instead of the ‘=’ 
operator. This indicates that the class itself will never be created, only subclasses of this 
class. For example: 
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statement ~= { org: origin 



To allow compact and clear specification of a class with many subclasses, subclasses 
can be specified in the class itself. For example: 

statement = { org: origin } + 

if Statement: { cond:expr, then: block, else: block } | 
whileStatement : { cond:expr, body: block } | 
f or Statement : ■[ var: string, bound :expr, body: block } | 
assignStatement : {. lhs:expr, rhs:expr } 

Every labeled component is called an alternative', every alternative defines a subclass 
with the name of its label. A class containing alternatives is always virtual. Thus, the 
definition above is equivalent with: 

statement ~= { org: origin 

if Statement = statement + { cond:expr, then: block, else: block 
whileStatement = statement + ■[ cond:expr, body: block 
forStatement = statement + ■[ var: string, bound :expr, body: block 
assignStatement = statement + ■[ lhs:expr, rhs:expr 



3.3 Tuple Types 

A tuple consists of a list of fields separated by commas and surrounded with parentheses. 
Like all type definitions, it must be terminated by a semicolon (‘ ; ’)• For example: 

origin == ( file: string, line:int ); 



The ‘==’ operator introduces a tuple type. 

A tuple can inherit from other types. For example, the following tuple inherits from 
statement: 



ifStatement == statement + ( cond:expr, then:block, else:block ); 

A tuple statement cannot contain alternatives or multiple lists of fields. 

A tuple type can always be converted to an equivalent class type; tuples are provided 
for compactness and efficiency. 



3.4 Restrictions 

A number of restrictions are enforced on the type definitions: 

- A type can not have the same name as a previously defined type. 

- A type can not, directly or indirectly, inherit from itself. 

- A type can not, directly or indirectly, inherit the same type twice. 

- A type can not have two fields with the same name, or inherit a field with the same 
name as one of its own fields. 
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4 The Tm Template Language 

The T m template language is an untyped interpreted programming language to manip- 
ulate T m type definitions and text. It is powerful enough to generate code for arbitrary 
programming languages, and for metalevel computations such as generating sequence 
numbers, calculating the dependencies between types, and calculate the transitive closure 
of these dependencies. 

In Tm templates, all lines starting with a dot are commands. Lines that do 
not start with a dot are copied to the output. In both command lines and output lines, 
expressions starting with a $ are expanded. Expressions of the form $ () denote variable 
references, expressions of the form $ [] denote arithmetic expressions, expressions of the 
form ${} denote function invocations, and all other expressions of the form $<letter> 
denote variable references to the variable <letter>. For example, the template: 

.set n 4 

. set words for while goto 
int br[$n,${len $ (words)}] ; 
int ht [$ [$n*${len $ (words)] }]] ; 

will produce: 

int br [4,3] ; 
int ht [12] ; 

The function len calculates the length of the list it is given, in this case the list assigned 
to variable words. The $ [] expression in the declaration of ht multiplies the calculated 
length by n. 

There are also functions to list the defined types, list the field names of a given type, 
retrieve the type of a given field, manipulate strings, etc. There are also commands to 
include files, define macros, etc. For further details see Q. 



5 The Tm C Templates 

As part of the core T m distribution a number of templates for the C programming language 
are provided. These templates have been used in a large range of programs, including T m 
itself and in the Timber compiler described below. It is useful to distinguish three different 
types of template: administration templates, which generate code for general-purpose 
administration of types, tree walker templates, that generate code to visit particular 
nodes in a tree, and analysis templates, that generate code to traverse a tree and collect 
information about the nodes in the tree. 

For example, using the type definitions of Fig. [2| consider the following template: 

.set wantdefs rdup_origin 
. set basename demo 
. include tmc . ct 

The variable wantdefs is set to the list of functions that should be generated. In this 
case only the function rdup_ origin is requested. The last line includes the standard 
administration template file tmc . ct. The code in this file will generate the requested 
function. 
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From this template, T m will generate a function rdup_or igin that creates a duplicate 
of an origin instance. The C templates automatically generate other functions when 
they are necessary to implement the requested functions. In this case, the template will 
also generate a function new_origin that, given a string and an integer, creates a new 
instance of origin. 



5.1 Administration 

The C administration code templates can generate code to: 

- Create and destroy instances of the defined types. 

- Read and write an ASCII representation of instances of these types. 

- Compare two instances. 

- Manipulate lists: append to, insert in, delete from, reverse, concatenate. 

- Duplicate type instances. 

For example, to create new instances of the types of Fig. E] the following functions 
can be generated: 

origin new_origin( int line, string file ); 

expr new_VarExpr( origin org, string nm ); 

expr new_AddExpr ( origin org, expr 1, expr r ); 

expr new_CallExpr ( origin org, string fn, expr_list parms ) ; 

expr_list new_expr_list () ; 



To recursively free instances of these types, the following functions can be generated: 

void rfre_origin( origin e ) ; 

void rfre_expr( expr e ) ; 

void rf re_expr_list ( expr_list 1 ); 

As explained above, the C templates automatically generate other functions if they are 
necessary to implement the requested functions. 



5.2 Tree Walkers 

It is often necessary to traverse (‘walk’) a tree, and visit all nodes of a specific type. For 
example, in the types of Fig.[2| we might want to visit all NegExpr nodes containing 
a ConstExpr, and replace them with a new ConstExpr. The action to be performed 
on each node must be written by the user. However, code is also needed to traverse the 
tree and ensure that all instances of the target nodes are visited, and T m can take care of 
that. Appendix 0 shows a tree walker to implement our example, here we only briefly 
describe its requirements and features. 

The tree walker template requires the following from the programmer: 

- A list of node types to start the walk from, and a list of node types to visit. 

- Action functions for all node types that must be visited. 

- Macros for generating signatures and invocations of the walker functions. 
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From this information T m computes the set of nodes to walk, and generates appropriate 
walker functions. The action functions provided by the user are copied to the output file, 
and together they form a complete tree walker. 

By letting the user specify the signature of the walker and action functions, the 
tree walkers are flexible enough to pass arbitrary information into the tree walk, and to 
accumulate arbitrary information during the tree walk. 

Using a tree walker has the usual advantages of code templates: extensive code re-use. 
Moreover, the tree- walker template automates the calculation of the required traversal. 
Since that is an error-prone task that must be repeated after every change or addition to 
the data structures, automation greatly improves the reliability of the traversal code. 

A tree walker is similar in concept to the visitor pattern that has been proposed as 
a design pattern for object-oriented programming E). In both cases we wish to apply 
operations on a set of node types in a tree. The visitor pattern is implemented by adding 
a method to all node types. These methods implement a walk over the entire tree. During 
the walk, nodes are passed to a visitor method that applies the appropriate method for 
that type of node. A different type of walk over the tree only requires the definition of a 
different visitor method. 

Although the visitor pattern has some of the advantages of a Tm template, it also 
has a number of drawbacks. In particular, it is still necessary to implement the (generic) 
tree walk by hand. Moreover, the visitor methods are often complicated since the correct 
action for every type of node must be determined and executed. Finally, the entire tree 
is always visited, even if a particular walk does not require it. 

In contrast, for a Tm tree walker all tree traversal and type inspection code is gener- 
ated; the user only needs to supply the code for the operations on the visited types. 

5.3 Analyzers 

One specific type of tree walker is used to collect information about a tree. For example, 
we might want to estimate the size of the generated code, determine whether an expres- 
sion has side-effects, or collect the variables that are used in a code fragment. We call 
such tree walkers analyzers, and we provide a specialized template to generate them. 
An analyzer must not modify the tree it walks, and its operation must be a reduction 
operation. Typical reduction operators are boolean and and or, summation (for example 
to calculate the estimated size of a code fragment), and list concatenation (for example 
to collect all variable names in a code fragment). 

The analyzer template requires the following from the programmer: 

- A list of node types to start the walk from, and a list of node types to visit. 

- For all the node types to visit, a classification of the node. The method can be ignore 
(do not visit this node), reduction (the value is the reduction of the values of its 
fields, possibly combined with a given constant), constant (the value is the given 
constant), or function (the value is computed by a user-supplied function). 

- The type of the analysis result (e.g. int). 

- The reduction operator to apply (e.g. addition). 

- The neutral element of the reduction (e.g. 0). 

- A macro to generate walker function signatures. 

- Optionally, a termination test expression. 
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From this information the set of nodes to walk is computed, and appropriate walker 
functions are generated. The termination test expression allows useless tree walks to be 
cut off. For example, once the intermediate result of a boolean and reduction is false, 
the traversal can stop, since the result will always false. 



6 Application of Tm in the Timber Compiler 

Tm and its C templates are used extensively in the Timber compiler I9I1Q1 . a static 
compiler for a superset of Java d. To illustrate the usefulness of Tm we will describe 
the impact that the use of Tm has had on the compiler. 

Internally, the Timber compiler consists of three modules (Fig.O: a frontend that 
translates Spar/Java to an intermediate representation called Vnus 12I8H . a number of 
parallelization engines that rewrite Vnus, and a backend that translates Vnus to C++ 
code. 




Spar/Java Vnus Vnus C++ 



Fig. 3. Data flow in the Timber compiler. 



To give an indication of the amount of work Tm has saved us, we will show statistics 
comparing the number of lines of hand- written and generated cod43. We calculate the 
amount of generated code by counting the lines in the generated source files, and sub- 
tracting the number of lines in the template file. For the amount of hand- written code 
we count the lines in the non-generated source files, and in the template files. 




handwritten 




lines 


% 


Tm administration 


handwritten 


119,555 


26.3% 


Tm treewalkers 


Tm administration 175,744 


38.6% 


Tm analyzers 


Tm tree walkers 


144,643 


31.8% 


yacc generated 


Tm analyzers 


10,698 


2.4% 




yacc 


4,533 


1.0% 



Fig. 4. Code origin for the entire Timber compiler. 

^ This comparison is meaningful because the style of the code generated by our T m templates is 
similar to what we write ourselves. 
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We assign each line of code that is passed to the C compiler to one of the following 
five categories: hand- written, generated by yacc, or generated by a Tm administration, 
tree walker or analysis template. 

Figure [4| shows the statistics for the entire compiler. In subsequent sections we will 
show the statistics for the individual compiler phases. 

As these figures show, nearly 75% of the compiler code is generated by Tm. Roughly 
half of the generated code is for administration, and the other half implements tree walk- 
ers. Only a small fraction of the generated code is devoted to analyzer tree walkers. One 
reason for this is that analyzer tree walkers are a fairly recent development; some analy- 
sis operations are still done in hand- written code, even though in a new implementation 
an analyzer tree walker would be used. 

The Timber compiler has taken an estimated five person-years to implement: three 
person-years to implement a static Java compiler, and two to implement the language 
extensions and the parallelization engines. The resulting compiler is able to compile 
large programs and large parts of the standard Java library to efficient executables. 

6.1 Communication between Engines 

The Timber compiler consists of independent programs, called engines, that are ‘glued’ 
together with a shell script. Internally, each engine represents the program as a tree 
of Tm types. Communication between the compiler engines is implemented using Tm- 
generated functions. These functions print a tree to a textual representation in a file, and 
convert this textual representation back into a tree. 

6.2 The Spar/Java Frontend 

Figure [ 5 ] shows the code generation statistics for the Spar/Java frontend. 



The frontend parses Spar/Java, applies the semantic checks required by Java on the 
program, applies a number of optimizations, and generates Vnus. A number of tree 
walkers implement distinct compiler phases. In order of their application they do the 
following: 




handwritten 
Tm administration 
Tm tree walkers 
Tm analyzers 
yacc 



handwritten 50,718 34.1% 

Tm administration 44,640 30.0% 

Tm tree walkers 41,557 27.9% 

Tm analyzers 9,372 6.3% 

yacc 2,554 1.7% 



lines % 



Fig. 5. Origin of the frontend code. 
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- Rewrite some constructs to simplify the remaining phases. 

- Register class declarations in the symbol table. 

- Register methods and constructors in the symbol table. 

- Bind variables, types, methods and constructors. 

- Check correctness of the program. 

- Apply a number of code optimizations, e.g. Mining, constant folding. 

- Add garbage-collection administration code. 

- Eliminate unused variable declarations. 

Other tree walkers implement auxiliary operations that work on fragments of code instead 
of an entire program. They do the following: 

- Mark variables that are only read as ‘final’ . 

- Rename variable references (used in method inlining). 

- List the scope names of a code fragment. 

- List variables that are not bound in the given code fragment. 

- Do constant folding on an expression. 

- List the assigned variables of a code fragment. 

- Update the use count of the methods used in a code fragment. 

- Rewrite ‘return’ statements to ‘goto’ statements (used in method inlining). 

A number of analyzer tree walkers are also used, which do the following: 

- Estimate the size of a given code fragment. 

- DeterMne whether an expression is constant. 

“ DeterMne whether an expression requires the garbage-collection administration to 
be up-to-date. 

- DeterMne whether a code fragment alters the state of the garbage-collection ad- 
ministration. 

- DeterMne whether an expression has side effects. 

- DeterMne whether an expression evaluates to zero. 



6.3 The Parallelization Engines 

Ligure[S|shows the code generation statistics for the parallelization engines. 

The parallelization engines transform implicitly parallel Vnus programs (sequential 
programs with parallelization annotations) to explicitly parallel Vnus programs. The 
engines are implemented as a set of 57 rules that each apply a simple rewrite operation 
on the Vnus program. These rules are implemented as tree walkers. Some example rules 
are: 



- Search for loops that only contain a communication statement for a single element, 
and replace them by code that communicates all elements in a single message. 

- Exchange loops in a loop nest when this is profitable. 

- Simplify if statements with a constant true or false condition. 
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lines 
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■ Tm tree walkers 


handwritten 


18,534 


9.4% 




Tm administration 85,536 


43.3% 




Tm tree walkers 


93,680 


47.4% 


Fig. 6. Origin of the parallelization code. 
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■ Tm tree walkers 


Tm administration 45,568 


42.0% 


M Tm analyzers 
■ yacc 


Tm tree walkers 


9,406 


8.7% 




Tm analyzers 


1,326 


1.2% 




yacc 


1,979 


1.8% 


Fig. 7. Origin of the backend code. 







6.4 The Vnus Backend 

Figured shows the code generation statistics for the Vnus backend. 

The backend translates Vnus code to C++ code. Similar to the frontend, a number 
of tree walkers implement distinct compilation phases (checking, optimization), and a 
number of other tree walkers serve as auxiliary functions (constant folding, tests, etc.). 



7 Conclusions 

Our template-based metacompiler Tm is able to generate an extensive range of functions 
to manipulate data structures. Since it uses templates, the generated code is not restricted 
to a specific programming language. 

Since T m provides a full programming language for template implementation, it is 
possible to write highly sophisticated templates, for example the tree walker templates 
described in Section l5^ and the analyzers described in Section lS^] 

As we have shown, the use of Tm has had a profound impact on the implementation 
of Timber, our Spar/Java compiler. Nearly 75% of the source code of the compiler is 
generated by Tm, allowing rapid implementation of the compiler, and resulting in a 
much more robust and sophisticated compiler than would have been possible otherwise. 
Consequently, in three person-years we have been able to implement a Java compiler that 
is able to correctly compile large parts of the standard library to efficient executables. 
Our extensions to Java, and the parallelization engines were implemented in two person- 
years. 
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A Example Tree Walker 

Given the data structures of Fig. E] the following tree walker template generates code 
to rewrite all NegExpr instances containing a ConstExpr to a new constant expression. 
To do this, an action function f old_NegExpr_action is defined. See the comments in 
the template and the generated code for further explanation. 

.macro generate_walker_declaration v t 
static $t f old_$t_walker ( $t $v ) ; 

. endmacro 

.macro generate_walker_signature v t 
static $t f old_$t_walker ( $t $v ) 

. endmacro 

.. Given an indent, an expression, its real type and its 
.. perceived type, generate invocation of an action. 

.macro generate_action_call i x t n 
. if ${eq $t $n} 

$i$v = ($t) f old_$t_action( $x ); 

. else 

$i$v = ($t) fold_$t_action( ($t) $x ); 

. endif 
. endmacro 

.. Given an indent, an expression, its real type and its 
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. . perceived type, generate invocation of a walker. 

.macro generate_walker_call i v t n 
. if ${eq $t $n} 

$i$v = ($t) f old_$t_walker ( $v ); 

. else 

$i$v = ($t) f old_$t_walker ( ($t) $v ); 

. endif 
. endmacro 

.. If ^t^ has an action, invoke it, else invoke its walker 
.macro generate_descent_call i v t n 
.if ${member $t $ (actors)} 

.call generate_action_call "$i" "$v" "$t" "$n" 

. else 

.call generate_walker_call "$i" "$v" "$t" "$n" 

. endif 
. endmacro 

. set actors NegExpr 

. . Insert the macros required for tree walking. 

.insert tmcwalk.t 

. . Calculate which types must be visited. 

.set visit_types ${call calc_treewalk "expr" "$ (actors)"} 

. . Generated forward declarations for the walker functions 
.call generate_walker_f orwards "$ (visit_types) " 

static expr f old_NegExpr_action( NegExpr x ) 

. call generate_walker_call " " x NegExpr NegExpr 

if( x->x->tag == TAGConstExpr ){ 

ConstExpr res = (ConstExpr) (x->x) ; 

x->x = exprNIL; 

rfre_expr( x ); 

res->n = -res->n; 

return (expr) res; 

} 

return x; 

} 

.. Generate the walker functions. 

.call generate_walker "$ (visit_types) " 

When this template is executed, the following code is generated: 

/♦ Generated forward declarations start here ♦/ 

/♦ Forward declarations. ♦/ 

static AddExpr f old_AddExpr_walker ( AddExpr e ); 
static SubExpr f old_SubExpr_walker ( SubExpr e ) ; 
static NegExpr f old_NegExpr_walker ( NegExpr e ) ; 
static expr_list f old_expr_list_walker ( expr_list e ); 
static CallExpr f old_CallExpr_walker ( CallExpr e ); 
static expr f old_expr_walker ( expr e ); 

/♦ Generated forward declarations end here ♦/ 

static expr f old_NegExpr_action( NegExpr x ) 

X = (NegExpr) f old_NegExpr_walker ( x ); 
if( x->x->tag == TAGConstExpr ){ 

ConstExpr res = (ConstExpr) (x->x) ; 

x->x = exprNIL; 

rfre_expr( x ); 

res->n = -res->n; 

return (expr) res; 

} 

return x; 

} 

/♦ Generated code starts here ♦/ 
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/♦ Walker for class AddExpr. ♦/ 

static AddExpr f old_AddExpr_walker ( AddExpr e ) 

■c 

e->l = (expr) f old_expr_walker ( e->l ); 
e->r = (expr) f old_expr_walker ( e->r ); 

} 

/♦ Walker for class SubExpr. ♦/ 

static SubExpr f old_SubExpr_walker ( SubExpr e ) 

■c 

e->l = (expr) f old_expr_walker ( e->l ); 
e->r = (expr) f old_expr_walker ( e->r ); 

} 

/♦ Walker for class NegExpr. ♦/ 

static NegExpr f old_NegExpr_walker ( NegExpr e ) 

e->x = (expr) f old_expr_walker ( e->x ); 

} 

/♦ Walker for list expr_list. ♦/ 

static expr_list f old_expr_list_walker ( expr_list e ) 

■c 

■c 

unsigned int ix; 

for( ix=0; ix<e->sz; ix++ ){ 

e->arr[ix] = (expr) f old_expr_walker ( e->arr[ix] ); 

} 

} 

} 

/♦ Walker for class CallExpr. ♦/ 

static CallExpr f old_CallExpr_walker ( CallExpr e ) 

e->parms = (expr_list) f old_expr_list_walker ( e->parms ); 

} 

/♦ Walker for class expr. ♦/ 

static expr f old_expr_walker ( expr e ) 

switch( e->tag ){ 
case TAG AddExpr: 

e = (AddExpr) f old_AddExpr_walker ( (AddExpr) e ); 
break; 

case TAGSubExpr: 

e = (SubExpr) f old_SubExpr_walker ( (SubExpr) e ); 
break; 

case TAGNegExpr: 

e = (NegExpr) f old_NegExpr_action( (NegExpr) e ) ; 
break; 

case TAGCallExpr: 

e = (CallExpr) f old_CallExpr_walker ( (CallExpr) e ); 
break; 

default : 
break; 

} 

} 



/♦ Generated code ends here ♦/ 
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Abstract. I propose a set of criteria which distinguish a grand challenge in 
science or engineering from the many other kinds of short-term or long-term 
research problems that engage the interest of scientists and engineers. As an 
example drawn from Computer Science, I revive an old challenge: the 
construction and application of a verifying compiler that guarantees correctness 
of a program before running it. 



1 Introduction 

The primary purpose of the formulation and promulgation of a grand challenge is to 
contribute to the advancement of some branch of science or engineering. A grand 
challenge represents a commitment by a significant section of the research community 
to work together towards a common goal, agreed to be valuable and achievable by a 
team effort within a predicted timescale. The challenge is formulated by the 
researchers themselves as a focus for the research that they wish to pursue in any 
case, and which they believe can be pursued more effectively by advance planning 
and co-ordination. Unlike other common kinds of research initiative, a grand 
challenge should not be triggered by hope of short-term economic, commercial, 
medical, military or social benefits; and its initiation should not wait for political 
promotion or for prior allocation of special funding. The goals of the challenge should 
be purely scientific goals of the advancement of skill and of knowledge. It should 
appeal not only to the curiosity of scientists and to the ambition of engineers; ideally 
it should appeal also to the imagination of the general public; thereby it may enlarge 
the general understanding and appreciation of science, and attract new entrants to a 
rewarding career in scientific research. 

An opportunity for a grand challenge arises only rarely in the history of any 
particular branch of science. It occurs when that branch of study first reaches an 
adequate level of maturity to predict the long-term direction of its future progress, and 
to plan a project to pursue that direction on an international scale. Much of the work 
required to achieve the challenge may be of a routine nature. Many scientists will 
prefer not to be involved in the co-operation and co-ordination involved in a grand 
challenge. They realize that most scientific advances, and nearly all break-throughs, 
are accomplished by individuals or small teams, working competitively and in relative 
isolation. They value their privilege of pursuing bright ideas in new directions at short 
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notice. It is for these reasons that a grand challenge should always be a minority 
interest among scientists; and the greater part of the research effort in any branch of 
science should remain free of involvement in grand challenges. 

A grand challenge may involve as much as a thousand man-years of research 
effort, drawn from many countries and spread over ten years or more. The research 
skill, experience, motivation and originality that it will absorb are qualities even 
scarcer and more valuable than the funds that may be allocated to it. For this reason, a 
proposed grand challenge should be subjected to assessment by the most rigorous 
criteria before its general promotion and wide-spread adoption. These criteria include 
all those proposed by Jim Gray [1] as desirable attributes of a long-range research 
goal. The additional criteria that are proposed here relate to the maturity of the 
scientific discipline and the feasibility of the project. In the following list, the earlier 
criteria emphasize the significance of the goals, and the later criteria relate to the 
feasibility of the project, and the maturity of the state of the art. 

• Fundamental. It arises from scientific curiosity about the foundation, the nature, 
and the limits of an entire scientific discipline, or a significant branch of it. 

• Astonishing. It gives scope for engineering ambition to build something useful 
that was earlier thought impractical, thus turning science fiction to science fact. 

• Testable. It has a clear measure of success or failure at the end of the project; 
ideally, there should be criteria to assess progress at intermediate stages too 

• Inspiring. It has enthusiastic support from (almost) the entire research 
community, even those who do not participate in it, and do not benefit from it. 

• Understandable. It is generally comprehensible, and captures the imagination of 
the general public, as well as the esteem of scientists in other disciplines. 

• Useful. The understanding and knowledge gained in completion of the project 
bring scientific or other benefits; some of these should be attainable, even if the 
project as a whole fails in its primary goal. 

• Historical. The prestigious challenges are those which were formulated long ago; 
without concerted effort, they would be likely to stand for many years to come. 

• International. It has international scope, exploiting the skills and experience of 
the best research groups in the world. The cost and the prestige of the project is 
shared among many nations, and the benefits are shared among all. 

• Revolutionary. Success of the project will lead to radical paradigm shift in 
scientific research or engineering practice. It offers a rare opportunity to break 
free from the dead hand of legacy. 

• Research-directed. The project can be forwarded by the reasonably well 
understood methods of academic research. It tackles goals that will not be 
achieved solely by commercially motivated evolution of existing products. 

• Challenging. It goes beyond what is known initially to be possible, and requires 
development of understanding, techniques and tools unknown at the start. 

• Feasible. The reasons for previous failure to meet the challenge are well 
understood and there are good reasons to believe that they can now be overcome. 

• Incremental. It decomposes into identified intermediate research goals, which 
can be shared among many separate teams over a long time-scale. 

• Co-operative. It calls for planned co-operation among identified research teams 
and research communities with differing specialized skills. 
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• Competitive. It encourages and benefits from competition among individuals 
and teams pursuing alternative lines of enquiry; there should be clear criteria 
announced in advance to decide who is winning, or who has won. 

• Effective. Its promulgation changes the attitudes and activities of research 
scientists and engineers. 

• Risk-managed. The risks of failure are identified, symptoms of failure will be 
recognized early, and strategies for cancellation or recovery are in place. 

The tradition of grand challenges is common in many branches of science. If you 
want to know whether a challenge qualifies for the title ‘Grand’, compare it with 

- Prove Fermat’ s last theorem (accomplished) 

- Put a man on the moon within ten years (accomplished) 

- Cure cancer within ten years (failed in 1970s) 

- Map the Human Genome (accomplished) 

- Map the Human Proteome (too difficult for now) 

- Find the Higgs boson (under investigation) 

- Find Gravity waves (under investigation) 

- Unify the four forces of Physics (under investigation) 

- Hilbert’s programme for mathematical foundations (abandoned in 1930s) 

All of these challenges satisfy many of the criteria listed above in varying degrees, 
though no individual challenge could be expected to satisfy all the criteria. The first in 
the list was the oldest and in some ways the grandest challenge; but being a 
mathematical challenge, my suggested criteria are considerably less relevant for it. 

In Computer Science, the following examples may be familiar from the past. That 
is the reason why they are listed here, not as recommendations, but just as examples 



- Prove that P is not equal to NP 

- The Turing test 

- The verifying compiler 

- A championship chess program 

- A GO program at professional standard 

- Automatic translation from Russian to English 



(open) 

(outstanding) 
(abandoned in 1970s) 
(completed) 

(too difficult) 

(failed in 1960s) 



The first of these challenges is of the mathematical kind. It may seem to be quite 
easy to extend this list with new challenges. The difficult part is to find a challenge 
that passes the tests for maturity and feasibility. The remainder of this contribution 
picks just one of the challenges, and subjects it to detailed evaluation according to the 
seventeen criteria. 



2 The Verifying Compiler: Implementation and Application 

A verifying compiler [2] uses automated mathematical and logical reasoning methods 
to check the correctness of the programs that it compiles. The criterion of correctness 
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is specified by types, assertions, and other redundant annotations that are associated 
with the code of the program, often inferred automatically, and increasingly often 
supplied by the original programmer. The compiler will work in combination with 
other program development and testing tools, to achieve any desired degree of 
confidence in the structural soundness of the system and the total correctness of its 
more critical components. The only limit to its use will be set by an evaluation of the 
cost and benefits of accurate and complete formalization of the criterion of 
correctness for the software. 

An important and integral part of the project proposal is to evaluate the capabilities 
and performance of the verifying compiler by application to a representative selection 
of legacy code, chiefly from open sources. This will give confidence that the 
engineering compromises that are necessary in such an ambitious project have not 
damaged its ability to deal with real programs written by real programmers. It is only 
after this demonstration of capability that programmers working on new projects will 
gain the confidence to exploit verification technology in new projects. 

Note that the verifying compiler itself does not itself have to be verified. It is 
adequate to rely on the normal engineering judgment that errors in a user program are 
unlikely to be compensated by errors in the compiler. Verification of a verifying 
compiler is a specialized task, forming a suitable topic for a separate grand challenge. 

This proposed grand challenge is now evaluated under the seventeen headings 
listed in the introduction. 

Fundamental. Correctness of computer programs is the fundamental concern of the 
theory of programming and of its application in large-scale software engineering. The 
limits of application of the theory need to be explored and extended. The project is 
self-contained within Computer Science, since it constructs a computer program to 
solve a problem that arises only from the very existence of computer programs. 

Astonishing. Most of the general public, and even many programmers, are unaware 
of the possibility that computers might check the correctness of their own programs; 
and it does so by the same kind of logical methods that for thousands of years have 
conferred a high degree of credibility to mathematical theorems. 

Testable. If the project is successful, a verifying compiler will be available as a 
standard tool in some widely used programming productivity toolset. It will have been 
tested in verification of structural integrity and security and other desirable properties 
of millions of lines of open source software, and in more substantial verification of 
critical parts of it. This will lead to removal of thousands of errors, risks, insecurities 
and anomalies in widely used code. Proofs will be subjected to check by rival proof 
tools. The major internal and external interfaces in the software will be documented 
by assertions, to make existing components safer to use and easier to reuse [3]. The 
benefits will extend also to the evolution and enhancement of legacy code, as well as 
the design and development of new code. Eventually programmers will prefer to 
confine their use of their programming language to those features and structured 
design patterns which facilitate automatic checks of correctness [4,5]. 

Inspiring. Program verification by proof is an absolute scientific ideal, like purity of 
materials in chemistry or accuracy of measurement in mechanics. These ideals are 
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pursued for their own sake, in the controlled environment of the research laboratory. 
The practicing engineer in industry has to be content to work around the impurities 
and inaccuracies that are found in the real world, and often considers laboratory 
science as unhelpful in discharging this responsibility. The value of purity and 
accuracy (just like correctness) are often not appreciated until after the scientist has 
built the tools that make them achievable. 

Understandable. All computer users have been annoyed by bugs in mass market 
software, and will welcome their reduction or elimination. Recent well-known viruses 
have been widely reported in the press, and have been estimated to cost billions of 
dollars. Fear of cyber-terrorism is quite widespread [6,7]. Viruses can often obtain 
entry into a computer system by exploiting errors like buffer overflow, which could 
be caught quite easily by a verifying compiler [8]. 

Trustworthy software is now recognised by major vendors as a primary long-term 
goal [9]. The interest of the press and the public in the project can be maintained, 
whenever dangerous anomalies are detected and removed from software that is in 
common use. 

Useful. Unreliable software is currently estimated to cost the US some sixty billion 
dollars [10]. A verifying compiler would be a valued component of the proposed 
Infrastructure for Software Testing. 

A verifying compiler may help accumulate evidence that will help to assess and 
reduce the risks of incorporation of commercial off-the-shelf software (COTS) into 
safety critical systems. The project may extend the capabilities of load- time checking 
of mobile proof-carrying code [11]. It will provide a secure foundation for the 
achievement of trustworthy software. 

The main long-term benefits of the verifying compiler will be realised most 
strongly in the development and maintenance of new code, specified, designed and 
tested with its aid. Perhaps we can look forward to the day when normal commercial 
software will be delivered with an eighty percent chance that it never needs recall or 
correction by service packs, etc. within the first ten years after delivery. Then the 
suppliers of commercial and mass-market software will have the confidence to give 
the normal assurances of fitness for purpose that are now required by law for most 
other consumer products. 

Historical. The idea of using assertions to check a large routine is due to Turing [12]. 
The idea of the computer checking the correctness of its own programs was put 
forward by McCarthy [13]. The two ideas were brought together in the verifying 
compiler by Floyd [14]. Early attempts to implement the idea [15] were severely 
inhibited by the difficulty of proof support with the machines of that day. At that time, 
the source code of widely used software was usually kept secret. It was generally 
written in assembler for a proprietary computer architecture, which was often 
withdrawn after a short interval on the market. The ephemeral nature and limited 
distribution for software written by hardware manufacturers reduced motivation for a 
major verification effort. 

Since those days, further difficulties have arisen from the complexities of modern 
software practice and modern programming languages [16]. Features such as 
concurrent programming, object orientation and inheritance, have not been designed 
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with the care needed to facilitate program verification. However, the relevant 
concepts of concurrency and objects have been explored by theoreticians in the ‘clean 
room’ conditions of new experimental programming languages [17,18]. In the 
implementation of a verifying compiler, the results of such pure research will have to 
be adapted, extended and combined; they must then be implemented and tested by 
application on a broad scale to legacy code expressed in legacy languages. 

International. The project will require collaboration among leading researchers in 
America, China, India, Australasia, and many countries of Europe. Some of them are 
mentioned in the Acknowledgements and the References. 

Revolutionary. At present, the most widely accepted means of raising trust levels of 
software is by massive and expensive testing. Assertions are used mainly as test 
oracles, to detect errors as close as possible to their place of occurrence [19]. 
Availability of a verifying compiler will encourage programmers to formulate 
assertions as specifications in advance of code, in the expectation that many of them 
will be verifiable by automated or semi- automated mathematical techniques. Existing 
experience of the verified development of safety-critical code [20,21] will be 
transferred to commercial software for the benefit of mass-market software products. 

Research-directed. The methods of research into program verification are well 
established in the academic research community, though they need to be scaled up to 
meet the needs of modern software construction. This is unlikely to be achieved 
solely in industry. Commercial programming tool- sets are driven necessarily by 
fashionable slogans and by the politics of standardisation. Their elegant pictorial 
representations can have multiple semantic interpretations, available for adaptation 
according to the needs and preferences of the customer. The designers of the tools are 
constrained by compatibility with legacy practices and code, and by lack of scientific 
education and understanding on the part of their customers. 

Challenging. Many of the analysis and verification tools essential to this project are 
already available, and can be applied now to legacy code [22-27]. But their use is still 
too laborious, and their improvement over a lengthy period will be necessary to 
achieve the goals of the challenge. The purpose of this grand challenge is to 
encourage larger groups to co-operate on the evolution of a small number of tools. 

Feasible. Most of the factors which have inhibited progress on practical program 
verification are no longer as severe as they were. 

1. Experience has been gained in specification and verification of moderately scaled 
systems, chiefly in the area of safety-critical and mission-critical software; but so 
far the proofs have been mainly manual [20,21]. 

2. The corpus of Open Source Software [http://sourceforge.net] is now universally 
available and used by millions, so justifying almost any effort expended on 
improvement of its quality and robustness. Although it is subject to continuous 
improvement, the pace of change is reasonably predictable. It is an important part 
of this challenge to cater for software evolution. 
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3. Advances in unifying theories of programming [28] suggest that many aspects of 
correctness of concurrent and object-oriented programs can be expressed by 
assertions, supplemented by automatic or machine-assisted insertion of 
instrumentation in the form of ghost (model) variables and assignments to them. 

4. Many of the global program analyses which are needed to underpin correctness 
proofs for systems involving concurrency and pointer manipulation have now been 
developed for use in optimising compilers [29] . 

5. Theorem proving technology has made great strides in many directions. Model 
checking [30-33] is widely understood and used, particularly in hardware design. 
Decision procedures [34] are beginning to be applied to software. Proof search 
engines [35] are now well populated with libraries of application-dependent 
theorems and tactics. Finally, SAT checking [36] promises a step-function increase 
in the power of proof tools. A major remaining challenge is to find effective ways 
of combining this wide range of component technologies into a small number of 
tools, to meet the needs of program verification. 

6. Program analysis tools are now available which use a variety of techniques to 
discover relevant invariants and abstractions [37-39]. It is hoped that that these 
will formalize at least the program properties relevant to its structural integrity, 
with a minimum of human intervention. 

7. Theories relevant for the correctness of concurrency are well established [40-42]; 
and theories for object orientation and pointer manipulation are under development 
[43,44]. 

Incremental. The progress of the project can be assessed by the number of lines of 
legacy code that have been verified, and the level of annotation and verification that 
has been achieved. The relevant levels of annotation are: structural integrity, partial 
functional specification, specification of total correctness. The relevant levels of 
verification are: by testing, by human proof, with machine assistance, and fully 
automatic. Most software is now at the lowest level - structural integrity verified by 
massive testing. It will be interesting to record the incremental achievement of higher 
levels by individual modules of code, and to find out how widely the higher levels are 
reasonably achievable; few modules are likely to reach the highest level of full 
verification. 

Cooperative. The work can be delegated to teams working independently on the 
annotation of code, on verification condition generation, and on the proof tools. 

1. The existing corpus of Open Source Software can easily be parcelled out to 
different teams for analysis and annotation; and the assertions can be checked by 
massive testing in advance of availability of adequate proof tools. 

2. It is now standard for a compiler to produce an abstract syntax tree from the source 
code, together with a data base of program properties. A compiler that exposes the 
syntax tree would enable many researchers to collaborate on program analysis 
algorithms, test harnesses, test case generators, verification condition generators, 
and other verification and validation tools. 

3. Modern proof tools permit extension by libraries of specialized theories [34]; these 
can be developed by many hands to meet the needs of each application. In 
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particular, proof procedures can be developed that are specific to commonly used 
standard application programmer interfaces for legacy code [45]. 

Competitive. The main source of competition is likely to be between teams that 
work on different programming languages. Some laboratories may prefer to 
concentrate on older languages, starting with C and moving on to C++. Others may 
prefer to concentrate on newer languages like Java or C#. 

But even teams working on the same language and on the same tool may compete 
in achieving higher levels of verification for larger and larger modules of code. There 
will be competition to find errors in legacy code, and to be the first to obtain 
mechanical proof of the correctness of all assertions in each module of software. The 
annotated libraries of open source code will be good competition material for the 
teams constructing and applying proof tools. The proofs themselves will be subject to 
confirmation or refutation by rival proof tools. 

Effective. The promulgation of this challenge is intended to cause a shift in the 
motivations and activities of scientists and engineers in all the relevant research 
communities. They will be pioneers in the collaborative implementation and use of a 
single large experimental device, following a tradition that is well established in 
Astronomy and Physics but not yet in Computer science. 

1. Researchers in programming theory will accept the challenge of extending proof 
technology for programs written in complex and uncongenial legacy languages. 
They will need to design program analysis algorithms to test whether actual legacy 
programs observe the constraints that make each theoretical proof technique valid. 

2. Builders of programming tools will carry out experimental implementation of the 
hypotheses originated by theorists; following practice in experimental branches of 
science, their goal is to explore the range of application of the theory to real code. 

3. Sympathetic software users will allow newly inserted assertions to be checked 
dynamically in production runs, even before the tools are available to verify them. 

4. Empirical Computer Scientists will apply tools developed by others to the analysis 
and verification of representative large-scale examples of open code. 

5. Compiler writers will support the proof goals by adapting and extending the 
program analyses currently used for optimisation of code; later they may even 
exploit for purposes of further optimization the additional redundant information 
provided with a verified program. 

6. Providers of proof tools will regard the project as a fruitful source of low-level 
conjectures needing verification, and will evolve their algorithms and libraries of 
theories to meet the needs of actual legacy software and its users. 

7. Teachers and students of the foundations of software engineering will be enthused 
to set student projects that annotate and verify a small part of a large code base, so 
contributing to the success of a world- wide project. 

Risk-managed. The main risks to the project arise from dissatisfaction of many 
academic scientists with existing legacy code and legacy languages. The low quality 
of existing software, and its low level of abstraction, may limit the benefit to be 
obtained from the annotations. Many failures of proof are not due to an error at all, 
but just to omission of a more or less obvious precondition. Many of the genuine 
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errors detected may be so rare that they are not worth correcting. In other cases, 
preservation of an existing anomaly in legacy software may be essential to its 
continuing functionality. Often the details of functionality of interfaces, either with 
humans or with hardware devices, are not worth formalising in a total specification, 
because testing gives an easier but adequate assurance of serviceability. 

Legacy languages add to the risks of the project. From a logical point of view, they 
are extremely complicated, and require sophisticated analyses to ensure that they 
observe the disciplines that make abstract program verification possible. Finally, one 
must recognize that many of the problems of present-day software use are associated 
with configuration and installation management, build files, etc, where techniques of 
program verification seem unable to contribute. 

The idealistic solution to these problems is to discard legacy and start again from 
scratch. Ideals are (or should be) the prime motivating force for academic research, 
and their pursuit gives scope for many different grand challenges. One such challenge 
would involve design of a new programming language and compiler, especially 
designed to support verification; and another would involve a re-write of existing 
libraries and applications to the higher standards that are achievable by explicit 
consideration and simplification of abstract interfaces. Research on new languages 
and libraries is in itself desirable, and would assist and complement research based on 
legacy languages and software. 

Finally, it must be recognized that a verifying compiler will be only part of a 
integrated and rational tool- set for reliable software construction and evolution, based 
on sound scientific principles. Much of its use may be confined to the relatively lower 
levels of verification. It is a common fate of grand challenges that achievement of 
their direct goal turns out to be less directly significant than the stimulus that its 
pursuit has given to the progress of science and engineering. But remember, that was 
the primary purpose of the whole exercise. 
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Abstract. In DSP processors, minimizing the amount of address cal- 
culations is critical for reducing code size and improving performance 
since studies of programs have shown that instructions that manipulate 
address registers constitute a significant portion of the overall instruc- 
tion count (up to 55%). This work presents a compiler-based optimiza- 
tion strategy to reduce the code size in embedded systems. Our strategy 
maximizes the use of indirect addressing modes with post-increment and 
post-decrement capabilities available in DSP processors. These modes 
can be exploited by ensuring that successive references to variables ac- 
cess consecutive memory locations. To achieve this spatial locality, our 
approach uses both access pattern modification (program code restruc- 
turing) and memory storage reordering (data layout restructuring). 



1 Introduction 

Address calculations play a key role in determining code quality in DSP proces- 
sors since instructions that manipulate address registers constitute a significant 
portion of overall instruction count. For example, it was found that for a set of 
codes from MediaBench suite (a popular benchmark suite for embedded systems) 
running on Motorola’s DSP56000 processor, nearly 55% of the instructions are 
used to manipulate address registers through explicit loads and stores [l5] . Con- 
sequently, optimizing address code generation by eliminating as many explicit 
address register loads as possible can result in significant improvements in code 
size and performance. Note that code size improvements are very important not 
only because code size directly determines the capacity of the customized in- 
struction memory (hence, its cost) in an embedded system, but also because a 
smaller instruction memory means lower power consumption. 

Address calculations in modern DSPs such as NEC 7701, Motorola 
DSP56000, Analog Devices ADSP21xx, and Texas Instruments TMS320C5x 
are done in address generation units (AGUs). An AGU contains a number of 
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address registers, the contents of which can be incremented or decremented 
in parallel with the ongoing activity in the main datapath. The instruction 
format for such processors allows one to encode a CPU activity and a post- 
increment/decrement of an address register in a single instruction. Thus, using 
post-increment/decrement operations instead of explicit address register loads 
enhances on-chip parallelism (performance) and reduces code size (as no separate 
instruction is necessary to update the address register). Cintra and Araujo 
report that although some of the register increment /decrement operations can 
be accommodated in VLIW instruction slots, modern VLIW DSP architectures 
also have auto-increment and auto-decrement modes; this is because exploiting 
these modes effectively saves one instruction slot which might be used for some 
other operation. 

An optimizing compiler can exploit these post-increment /decrement oper- 
ations by performing computation and data transformations as well as by as- 
signing variables to address registers optimally. Consider the following scenario 
where three scalar variables c, a, and b are to be accessed in the order c,a,b in 
a given DSP code. Also assume that the AGU in question has a single address 
register that can be post-incremented/decremented by 1 and that these three 
variables are stored in memory in the order a, b, c. The code for implementing 
this sequence of accesses uses three steps. The first step loads the address regis- 
ter with the address of c (the first variable in the access sequence) . To access the 
variable a next, the second step loads the address of a into the address register. 
In accessing the variable a, a post-increment operation can be used to modify 
the content of the address register so that it points to b which will be accessed 
next. In the final step, the variable b is accessed. Overall, we need to perform two 
explicit address register loads. In addition to being a waste of machine cycles, 
this increases code size and thereby the instruction memory size, which is at a 
premium in many embedded designs. 

We can reduce this overhead of explicitly updating the address register by 
using a better choice of the order in which the variables are stored in data mem- 
ory. Instead of the storage order a, b, c in the previous scenario, we can eliminate 
one of the two address register loads if we use the storage order c, a, b. In this 
case, first, we load the address register with the address of c and post-increment 
the address register to make sure that, after the execution of the statement that 
accesses c, it will point to the next location (which contains a). Next, we access 
the variable a, and use again post-increment to make the address register point 
to the variable b. Finally, we access the variable b. This problem of determin- 
ing the most suitable storage order of variables is called the offset assignment 
problem and has been partially addressed by Bartley [T], Liao et al. mm, and 
others (e.g., M). Basically, these solutions first determine a suitable storage 
order for variables and then assign address registers to these variables to mini- 
mize the number of address register loads. In essence, since we are determining 
the contents of the address register (s) before each variable access, this problem 
can also be defined as the address register assignment problem. 
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A major limitation of the techniques proposed so far for the address register 
assignment problem is that they either focus only on modifying the storage 
order of variables (e.g., cnini) or only on modifying the intra- statement access 
pattern using commutativity and associativity transformations (e.g., Il3])- In 
this work, we present a framework that considers both computation-based (intra- 
statement and inter-statement) transformations and storage-based optimizations 
in a unified setting for “reducing the code size of a given application;” that is, 
our main objective is to save the code space. More specifically, this work makes 
the following contributions. 

(1) It presents an algorithm based on access pattern modification that makes 
efficient use of post- increment /decrement addressing modes in DSPs. This al- 
gorithm assumes a fixed storage order for variables and restructures the code 
to exploit these addressing modes. This algorithm is more general than the one 
proposed in m as it considers both intra-statement and inter-statement trans- 
formations. 

(2) It gives an algorithm that modifies an access pattern (access sequences), 
given a partially- fixed storage order. A partially-fixed storage order is a storage 
order in which the memory locations of only a subset of the variables are fixed. 

(3) It combines these two algorithms with the storage order-based optimiza- 
tion strategy (i.e., offset assignment) developed by Liao et al. [TT|, and presents 
a unified approach (which is demonstrated to be superior) to handle the offset 
assignment problem for a given control flow graph. 

2 Review of Offset Assignment 

The offset assignment problem m is one of assigning a frame-relative offset 
(i.e., storage location) to each variable in the code in order to minimize the 
number of address arithmetic instructions (that is, the instructions that load a 
new value to the address register) required to execute the code. The cost of an 
offset assignment is defined as the number of such instructions. 

Given a code sequence, we can define a unique access sequence for it. In an 
operation a = b op c, where ‘op’ is some binary operator, the access sequence 
is given by b, c, a. The access sequence for an ordered set of operations is 
simply the concatenated access sequences for each operation taken in order. For 
example, for the code fragment 

a = c + d 

d = d+ c + b + c + a 

the access sequence isc, d, a, d, c, b, c, a, d, assuming that addition 
is left-associative. Let us assume that the variables in this code fragment are 
stored in memory in the following order: a , b , c , d. The cost of a given storage 
sequence (offset assignment) is the number of consecutive accesses (in the access 
sequence) for which the accessed variables are not assigned to adjacent locations 
in memory. Therefore, the cost of the offset assignment given above is four as 
there are four transitions in the access sequence between non- adjacent variables. 



276 



M. Kandemir et al. 



The objective of the offset assignment problem is to determine a storage order 
for variables such that the cost will be minimum. Liao [TO] showed that the 
offset assignment problem is equivalent to the Maximum Weighted Path Cover 
(MWPC) problem and proved that it is NP-complete. His heuristic solution 
was later improved by Leupers and Marwedel who presented a tie-breaking 
strategy for achieving better storage assignments. 

3 Computation Restructuring for a Fully Fixed Storage 
Sequence 

Code size reduction using address register assignment is achieved by making the 
access sequence (i.e., the order in which the variables are accessed) and the stor- 
age sequence (i.e., the storage order of the variables in memory) compatible. In 
practice, it is possible to do either of the following: modify the access sequence 
for a fixed storage sequence, or modify the storage sequence for a given fixed 
access sequence. In this section, we discuss a strategy that adopts the former 
approach as opposed to Liao’s scheme [TO] which takes the latter approach. In 
this work, we apply code transformations to a high-level intermediate represen- 
tation (IR) of the code where optimizations such as conventional (e.g., graph 
coloring-based) register allocation and common subexpression elimination have 
already been performed. This IR has statements very similar to high-level source 
statements. In the remainder of this presentation, when we mention statement, 
we actually refer to this IR- level statement. However, to make the presenta- 
tion clear, we use source-level (C-like) statements. Consider, a statement of the 
following form 

a = b + c 

Let us assume that the machine has a single address register and that the storage 
sequence is c , b , a. The access sequence in this example is b , c , a, which is 
different from the storage sequence. As a result of this, going from variable 
c to variable a incurs an explicit address register load (since c and a are not 
consecutive in the storage sequence, so we cannot use post- increment /decrement 
mode). Liao’s approach [T0| fixes this problem by modifying the storage sequence 
from c, b, a to b, c, a. Changing the storage sequence is a viable option 
provided that the variables have not yet been assigned to storage locations, 
or (if they have already been assigned to locations) the cost of transforming 
the storage sequence from one form to another (which may require copying 
resulting in additional memory requirements) does not outweigh its benefits. An 
access pattern-oriented approach, on the other hand, can optimize this code by 
transforming this statement into 

a = c + b 

The new access sequence is c, b, a which is the same as the storage se- 
quence. Note that, for this example, just applying commutativity transformation 
(an intra- statement transformation) was sufficient to obtain the desired result. 

Let us consider the following code fragment with two statements. 
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a = c + e 
b = c + f 

We assume a single address register and a storage sequence of a , b , c , d , e , 
f . It should be noted that each variable access in this code fragment (under the 
assumed storage sequence) will require a load to the address register. A storage 
layout-oriented scheme would change the storage sequence of the variables, but 
this may be too costly if the variables have already been assigned to storage 
locations (for example, during the optimization of a different set of statements 
that manipulate the same variables.) On the other hand, a commutativity trans- 
formation would lead to 

a = c + e 
b = f + c 

Note that this code fragment (which is obtained from the previous one by ap- 
plying commutativity transformation to the right-hand side of the second as- 
signment statement) eliminates one of the explicit loads to the address register. 
That is, in going from c to b in the second assignment statement, we can make 
use of the post-decrement mode (as these two variables are consecutive in mem- 
ory). An inter-statement transformation, on the other hand, can generate the 
following program fragment 

b = f + c 
a = c + e 

Note that this code fragment is obtained from the original one by interchanging 
the order of two statements and by applying commutativity transformation to 
one of the statements. In this case, two variable accesses (i.e., going from c to 
b in the first statement, and going from b in the first statement to c in the 
second statement) can be satisfied using post-increment /decrement modes. This 
is a simple example that illustrates the benefit of inter-statement optimization. 
However, there are some cases where it is not possible to interchange the order 
of statements due to data dependency constraints. For example, in the code 
fragment 



a = a + c 
c = c + 1 

interchanging two statements would give a wrong result as the value used for 
c in a = a + c would be different than the one in the original case. Here, a 
storage-oriented approach (e.g., m), on the other hand, could store a and c 
in consecutive locations in memory, thereby leading to the effective use of post- 
increment and decrement addressing modes. 

The preceding examples show that neither storage based techniques nor ac- 
cess sequence (computation) based techniques (intra and inter statement trans- 
formations) dominate the other, and a unified framework that uses both the tech- 
niques may be needed for better results. In the rest of this section, we formulate 
the computation oriented transformations using a graph-based representation. 
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3.1 Terminology 

We represent a program using a control flow graph (CFG) which is a directed 
graph in which each node denotes a basic block and an edge between two basic 
blocks indicates that there is a possibility that the flow of control (during exe- 
cution) may be transfered from one of these basic blocks to the other. A basic 
block can be defined informally as a straight-line sequence of statements that 
can be entered only at the beginning and exited only at the end p!6j . 

Consider a graph G = {V, E) where V is the set of nodes (vertices) and E is 
the set of edges. A path cover (or cover) C of a given graph G{V,E) is a set of 
paths such that every node in V is incident at some edge belonging to the chosen 
set of paths. In other words, we can think of a cover G{V', E') as a subgraph of 
G{V,E) where V' = V and E' C E. The length of a path is the number of edges 
in the path, and the length of a cover is the sum of the number of edges of each 
constituent path. A path that has the maximum length (among all paths in the 
cover) is referred to as the longest path. 



3.2 Layout Transition Graph 

Given a basic block, we use a layout transition graph (LTG) to show the con- 
nections between elements that are stored consecutively in memory. The layout 
transition graph of a basic block is a directed graph LTG{V, E), where each 
node Vi represents a variable that occurs in the basic block; and a directed edge 
e = (vi^Vj) from a node Vi to a node Vj indicates that the variable represented 
by Vi is stored (in memory) next to the variable represented by Vj. Whether Vi 
comes before Vj in the storage order or after Vj is not important for the purposes 
of this work (as long as they are consecutive in memory). An LTG also contains 
an edge from Vi to vj if these two nodes represent the occurrences of the same 
variable. Note that the variable access pattern of a program touches all the nodes 
of the corresponding LTG. 

For ease of exposition, we divide a given LTG into layers, each layer cor- 
responding to a statement in the basic block. If the basic block contains K 
statements, each variable Vi in the jth statement from top (denoted Sj where 
1 < i < A") is assumed to belong to the variable set of sj; we express this as 
Vi G Sj. We will use Sj to denote both the statement and its variable set, where 
there is no confusion. 

A given variable set Si can also be divided into two logical subsets: one that 
contains the variable on the left hand side (LHS), and one that contains the 
variables on the right hand side (RHS). For a variable set 5^, the first subset is 
denoted by SiL and the second subset is denoted by SiR. 

To illustrate these concepts, consider the LTG shown in Figure [U^i) for the 
statement a = b + c, assuming that the storage sequence is c, b, a. There 
is a bi-directional edge between c and b (i.e., we have a directed edge from c 
to b and one from b to c), and another bi-directional edge between b and a. 
Labeling this statement by 5i, we have sir = {a} and sir = {b, c}. Note 
that the access sequence for this statement is b, c, a as shown in Figure [TJiii) 
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using dashed arrows. It should also be noted that a new access sequence can 
be obtained by traversing the edges in the LTG in a different manner. If we 
start from the variable c, we can first traverse the edge (c,b) and then the edge 
(b,a), as depicted in Figure [T](iv). Note that this new traversal corresponds to 
transforming the statement from a = b + ctoa=c + b (i.e., a commutativity 
transformation) . 

We need to emphasize that it may not always be possible to transform a state- 
ment based on its LTG. Further, not every traversal of the edges in the LTG is 
legal. For example, going from a to b using the edge (a,b) is not acceptable (see 
Figure [H(v)) as all the right-hand side references should be accessed before the 
left hand side reference. We can prevent some of the transitions such as this by 
eliminating edges from the LTG that would lead to unacceptable or infeasible 
transformations. For example, in order to prevent a transformation from a to b, 
we eliminate the directed edge from a to b as shown in Figure |T|ii). Obviously, 
given the two legal traversals in Figures miii) and (iv), we prefer the one in 
Figure [T](iv) as all transitions between variables in this figure are between con- 
secutive memory locations, meaning that we can use post-increment /decrement 
mode for these transitions. Another way of expressing this is that both the 
edges visited during the traversal in Figure [Hiv) belong to the LTG given in 
Figure [TJii). On the other hand, one of the transitions taken during the traversal 
in Figure [TJiii) (the transition from c to a) does not have any corresponding 
edge in the LTG. Therefore, the objective of a traversal must be minimizing the 
number of transitions that do not correspond to an edge in the LTG. We will 
formalize this concept later. 

Now, let us consider the LTG given in Figure [T)(vi) for the following program 
fragment. 



a = c + e 
b = c + f 

It is assumed here that the storage sequence is a , b , c , d , e , f . As before, 
a traversal of the nodes of this LTG corresponds to a specific access sequence. 
The default access sequence is c, e, a, c, f, b as shown in Figure [TJviii). 
Note that a different traversal of the nodes corresponds to a transformation 
of the code sequence. Here, an important point should be noted. In traversing 
the nodes (or edges), we have a restriction in the sense that once we are in a 
statement we need to finish all the nodes in the statement before moving to a 
node in another statement. That is, we are not allowed to go from a node in SkR 
to a node in sj^'r if k 7^ k\ assuming that each statement has a left hand side 
variable. 

The preceding discussion indicates that we need some restrictions on the 
traversal order of the nodes in the LTG. For this purpose, we use a modified form 
of the LTG called constrained layout transition graph (GLTG), and perform our 
traversal on this graph. Simply, in those cases where the compiler can detect that 
variable Vi in statement Sk cannot be accessed immediately after the variable Vj 
in statement Sk' {sk and Sk' are not necessarily distinct here), the corresponding 
edge (if any) from Vj to Vi in the LTG should be removed when constructing 
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the CLTG (Instead of deleting edges from the LTG to construct the GLTG, 
it is possible to directly construct the GLTG using the necessary edges, albeit 
using somewhat more complicated rules. The correctness of the algorithms is not 
affected by the choice of either method to construct the CLTG). 

A constrained layout transition graph, written CLTG(V' ^ is a subgraph 
of the LTG{V, E) such that V' = V and E' contains all the edges in E ex- 
cept those that can lead to an incorrect or infeasible code transformation. The 
construction of the CLTG subsumes both the intra-statement constraints (i.e., 
evaluation rules that need to be obeyed when processing an RHS expression) 
and the inter-statement constraints (i.e., dependence and other constraints be- 
tween statements). For example, a CLTG cannot contain an edge between the 
variable occurrences of the right hand sides of two different assignment state- 
ments. In mathematical terms, an edge e = ('^i,'^j) ^ E does not belong to 
E' if Vi G SkR and Vi G Sk'R^ where k ^ k' . Figure [T](vii) depicts the CLTG 
for the LTG in Figure [TJvi). Note that the default traversal (access sequence) 
given in Figure [T](viii) does not use any of the edges in the underlying CLTG. 
Consequently, an explicit address register load is necessary prior to each variable 
access. Now consider the traversal given in Figure [Hix). In this case, the new 
access sequence corresponds to a transformation in which the right hand side 
of the second statement is transformed using commutativity. Note that one of 
the transitions in this traversal (i.e., the one from c to b) has a corresponding 
edge in the CLTG given in Figure [TJvii). Finally, let us focus on the traversal 
given in Figure [TJx). The transformation corresponding to this traversal is one of 
interchanging the order of the two statements and applying the commutativity 
transformation to one of the statements. In this traversal, two transitions, one 
going from c to b and the other going from b to c have corresponding edges in 
the CLTG. These two examples in Figure [U show that the preferred traversal 
must maximize the number of transitions that have corresponding edges in the 
underlying CLTG. In other words, it should minimize the number of transitions 
that do not have corresponding edges in the CLTG. 

It should be noted, however, that although a given CLTG shows possible 
legal transitions between nodes, it is still possible to generate an illegal traversal 
(access sequence) on the CLTG. For example, by itself, accessing two nodes Vi 
and Vj consecutively may not break any dependence; however, after this modified 
access sequence, it may not be possible to generate legal code due to a new 
restriction (in the access order) resulting from the said transition between Vi 
and Vj. 



3.3 Traversing the CLTG 



We formulate the problem of modifying a given basic block code for effective 
use of the address register(s) as one of determining a path cover and a traversal 
order in the CLTG. We assume for now that the AGU has only a single address 
register. 
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Fig. 1. ( i-v) LTG, CLTG, and different traversals for an assignment statement under 
the storage sequence c , b , a. (vi-x) LTG, CLTG, and different traversals for a program 
fragment under the storage sequence a, b, c, d, e, f. 
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Legality. In order to generate correct code (that is, to preserve the original 
semantics of the basic block), we impose the following conditions on the traversal 
order: 

(1) Each node in the LTG (i.e., a variable occurrence in the basic block) 
should be visited. 

(2) For a given layer in the LTG corresponding to the statement 5/^, all nodes 
in SkR should be visited before any node in SkL- 

(3) Once the traversal reaches the layer corresponding to the statement 5/^, it 
should finish all the variables in that layer (i.e., the set SkL^Skn) before moving 
to another layer. 

(4) All the data dependences and other restrictions such as latency con- 
straints or expression evaluation constraints should be observed. 

Gondition (1) indicates that each variable should be touched (by any legal exe- 
cution of the code). We enforce Gondition (4) by ensuring that we do not make 
a transition from a G 5/^ to a Vj G Sk> (even if Vi and Vj are consecutive in 
memory) when there is a data dependence from Sk> to Sk- To enforce Gondition 
(2), we do not allow a transition from the node Vi G SkL to a node Vj G SkR> To 
enforce Gondition (3), we disallow transitions between node Vi G SkR and any 
node Vj G Sk>R for k ^ k' . A transition from a node Vi G SkL to a node Vj G Sk>L 
(where k ^ k') allowed only if Sk> has no variables on the right hand side (i.e., 
Sk'R = 0 )- Also, there cannot be a transition from a node Vi G s^r to a node 
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Vj € Sk’L (where k i=- k') unless sy has no variable on the right hand side (i.e., 
•Sfc'R — 0 ) and s* has no TJIS variable, which cannot occur in our frameworlr. 



0 ) 





(iii) 

(ir''0‘--cg) 
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2nd path 



Fig. 2. (i) ITTG and (ii) CLTG for a given basic block* (iii) Defa.ult access sequence, 
(iv) OpLiiniaed access sequence* (v) Rxa.mpie paths in the CLTG* 



ProfiLabiliLy. The objective of the traversal of Llie nodes in the CLTG is to 
minimize the aosi of ihe traversal^ which k defined as the number of transitions 
from a node Vi to a node Vj such that vi a.nd Vj a, re not consecutive in the stora^^e 
sequence (Le*, there is no edge {vi, Vy) in the CLTG) Ibr all i and L Tt should be 
noted that a storage sequence imposes constraints on the CLTG. If a transition 
from Vi to docs not use an edge in the CLTG^ this means that a post- increment 
or a post- decrement caruiot he used for this transition; thus^ new value should 
be loaded in the address register (using an explicit load instruction), thereby 
increasing the code size. As a result, the cost of a traversal can be viewed as 
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the number of transitions in the access sequence that do not use an edge in the 
CLTG. Thus, the address register assignment problem can be re-expressed as 

determining a traversal of the nodes in the CLTG — suhjeet to the four 
legality eonditions listed above — that minimizes the number of transitions 
that do not eorrespond to an edge in the CLTG. 

It can be shown that this problem is NP-complete; but, we omit the proof due 
to lack of space. 

Let us now concentrate on the larger basic block given below assuming a 
storage sequence of a, b, c, d, e, f. 

c = a + b 
f = d - e - 2 
a = a + 3d 
c = 2f + 4 
d = d + f + a 

Figures |2fi) and (ii) show the LTG and CLTG, respectively, for this code 
fragment under the assumed storage sequence. Note that, in going from the 
LTG to the CLTG, many edges are dropped as they are not possible for any 
legal traversal. Figure [2Kiii) shows the default access sequence (i.e., without any 
optimization). This access sequence has a cost of eight, and the transitions that 
contribute to this cost are marked using the symbol Our approach, on the 
other hand, results in the access sequence (traversal) given in Figure|2jiv). We see 
that the cost of this access sequence is four (again, the transitions that contribute 
to the cost are marked using the symbol ‘*’). In other words, we are able to 
eliminate four address register loads in the code. This traversal corresponds to 
the following transformed program: 

c = a + b 
f = d - e - 2 
c = 2f + 4 
a = 3d + a 
d = a + f + d 

Note that this optimized code is obtained from the original one through one 
statement reordering (inter-statement transformation) and a number of intra- 
statement transformations. 



The Algorithm and Transformations. We now present an algorithm that 
takes as input a CLTG and generates as output a traversal (an access sequence) 
and all the necessary (inter-statement and intra-statement) transformations to 
obtain this access sequence. Given a CLTG, the algorithm first detects the 
longest directed path (Te., the path that contains the maximum number of edges 
in the same direction) HI It then transforms the portion of the CLTG (which con- 
tains a subset of the statements in the original basic block) in accordance with 

^ Note that the longest path detection problem is a hard problem in general. Here, we 
are employing a heuristic. 
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this longest path. Finding the longest path in a given directed graph is straight- 
forward, and takes 0 {N^) time, where N is the number of nodes in the graph 
1^. Transforming the program code in accordance with the longest path is more 
challenging. Consider the abstract CLTG in Figure|3]and the longest path shown. 
Note that each layer in the CLTG is labeled with a different statement id. The 
desired access sequence here is a, c, h, d, f, g, b, e. To achieve this access 
sequence, the following transformations need to be performed: 

(1) The variable a should be made the last variable accessed on the RHS of 
the statement Si; 

(2) In statement S2' (i) the variable h should be made the first variable 
accessed on the RHS; (ii) the variable h should be made to immediately precede 
the variable d; 

(3) Statement 54 should be made to immediately follow the statement S2; 

and 

(4) In Statement S4: (i) the access of variable b should be made to immedi- 
ately follow the variable g; (ii) the variable e should be made to immediately 
follow the variable b. 

In addition to these transformations, the transformed program should not modify 
the following properties of the input code (CLTG): 

(!’) Statement 52 immediately follows statement si. 

(2’) d is the last variable accessed on the RHS of Statement 82- 

(3’) g is the first variable accessed on the RHS in Statement S4. 

If the compiler can find a series of transformations to satisfy all these con- 
straints, we achieve the best possible access sequence (for this path). In many 
cases, however, this may not be possible due to inconsistencies between the re- 
quirements given above, or due to a situation that does not involve the variables 
on the longest path. An example of the former is the inconsistency between con- 
ditions (2.i), (2’), and (2.ii) above. That is, if we make the variable h the first 
variable on the RHS of the statement S2 and insist on keeping the variable d as 
the last variable on the RHS, it is not possible to access h and d successively as 
there are two more variables on the RHS. We assume that these other variables 
are different from those labeled in the figure. An example of the second type 
of difficulty is the possibility that it may not be legal to access the statement 
54 immediately after the statement 52 (as required by the condition(3)). This 
may occur for example if the statement 53 writes a variable x (assumed to be a 
different variable from the ones shown in the figure) that is subsequently read 
by the statement 54. Although it may not always be possible to achieve all of 
the desired transformations, our approach attempts to achieve as many of the 
desired transformations as possible. Note that this strategy helps to use as many 
edges in the CLTG as possible. 

After the longest path has been determined and the portion of the CLTG 
that contains the longest path (that is, a subset of the statements in the origi- 
nal basic block) has been transformed, our approach continues by selecting the 
second longest path and transforming the relevant parts of the CLTG. A special 
attention is paid to ensure that we do not modify any parts of the basic block 
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Fig. 3. An abstract CLTG and the longest path. 



that have already been transformed in accordance with a longer path considered 
earlier. In this way, our approach selects the next longest path in each step and 
transforms the relevant portions of the basic block. The process stops when it 
is not possible to transform the basic block any further (without distorting the 
previous transformations). In case we have two paths of the same length, the 
current implementation favors the one that leads to minimal modification to the 
original code. 

In the example in Figure following the construction of the CLTG shown in 
Figure[2](ii), our approach determines the longest path marked as the path in 
Figure [^v). Based on this path, it builds an access sub-sequence a, b, c, d, e, f , 
f . This sub-sequence completely specifies the transformations required for three 
of the five statements in the code (i.e., the first, second, and fourth statements 
in the original code). Note also that the transformations performed along this 
path include an inter- statement transformation. Next, it finds the path a, a, a 
(marked as the 2 ^^ path). Note that this path fixes the access sequence for the 
third statement in the original code completely as d, a, a. It also specifies that 
the variable a should be the first variable accessed in fifth statement. After that, 
the approach selects the path c, d, d. The (c,d) part of this path says that the 
fifth statement should follow the fourth statement in the transformed program, 
but this is not possible as the fourth statement has already been transformed, 
and it now (in the transformed code) comes before the third statement (in the 
original program). The (d,d) part of the path, on the other hand, is feasible, and 
indicates that d should be the last variable accessed in the fifth statement. The 
next path is c, d; but, the transformation implied by this is not possible. The 
last path is the one between c and d (marked as the 5^^ path in the figure). 
It implies that d should be the first variable accessed in the third statement, 
and the third and fourth statement should be interchanged. At this point, the 
algorithm has traversed all the paths. It next visits each statement, and fixes 
the access order for the variable whose order has not been fixed yet. It visits the 
fifth statement (in the original code) and makes f the second variable accessed 
on the RHS. The final access sequence is shown in Figure [2jiv). 
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4 Computation Restructuring: Partially Fixed Storage 
Sequence Case 

So far, we have assumed that the storage sequence (storage pattern) of variables 
is fixed completely. That is, a storage location is assigned to each program vari- 
able. In this section, we describe how to optimize an access sequence when only 
a subset of the variables have fixed memory locations. This is called the partially 
fixed storage. Specifically, given a partially fixed storage pattern of a basic block, 
we address two subproblems: 

(1) Determining the best access sequence for all variables in the basic block, 

and 

(2) Determining the storage sequence for the variables in the basic block 
whose memory locations are yet to be determined. 

This problem is important because the compiler employs it during procedure- 
wide optimization (as will be discussed in the next section). Our approach to 
the problem involves the following three steps: 

(1) Determine the best access (possibly partial) pattern for the partial storage 
order given, 

(2) Determine the storage sequence for the variables whose memory locations 
are yet to be determined, and 

(3) If there is further flexibility, then determine the best access pattern for 
the portions of the basic block that involves the variables whose storage sequence 
was determined in Step (2). 

Consider the following program fragment assuming a single address register 
and a partially fixed storage sequence of e , b , d. 

e = e + d 
a = d + c 
f = 3c + b 

a = (a * c) + (a * g) 

Figure |4ji) shows the CLTG for this basic block, under the given partial storage 
sequence. Clearly, there is just one path in this case. Transforming the code in 
accordance with this path gives us: 

e = d + e 
f = b + 3c 
a = d + c 

a = (a * c) + (a * g) 

Note that this transformation (which corresponds to Step (1) above) involves 
one statement interchange and one commutativity transformation. In the next 
step (which is Step (2) above), the compiler attempts to determine a storage 
sequence for the variables whose storage locations are yet to be determined. We 
achieve this using a modified version of Liao’s heuristic m- Liao summarizes the 
access sequence using a graph called the aeeess graph. In this graph, each variable 
is represented by a node and a weighted edge between two variables corresponds 
to the number of transitions between them. Liao then runs an algorithm on this 
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graph to select a path cover, with no node having more than two selected edges 
incident on it. 

The variables represented by the nodes connected by a selected edge are 
assigned to consecutive memory locations. The objective is to maximize the 
total weight of the edges selected (which corresponds to capturing the most 
frequent transitions). We modify this heuristic as follows. Let C = {vi} be the 
set of all variables Vi that have already been assigned to consecutive storage 
locations. Let us assume for now that there is only a single such set. We use he 
to denote the first (start) node of and te to denote the last (terminal) node. 
Each node in the modified access graph corresponds to either a single node Vj 
such that Vj ^ £ or a block node ve that represents £. There exists an edge 
between Vj (^ £) and ve if and only if there is an edge between Vj and he or an 
edge between Vj and te • We also keep track of whether the edge between Vj and 
ve is due to (incident on) he or te- 

Figure SJii) shows this modified access graph for our example. Note that this 
access graph is constructed by taking into account the transformations (both 
inter-statement and intra-statement) done in the previous step. Next, we run 
Liao’s heuristic HU] on this access graph. Figure HJiii) show the maximum weight 
cover detected by the heuristic. Afterwards, we determine the complete storage 
order (sequence) for the variables. In our example, this sequence is e, b, d, f , 
c , a, g. Although it does not occur in this example, in some cases, the compiler 
may have additional scope, and may apply Step (3) above to further modify 
the access pattern to accommodate the needs of the variables whose storage 
locations have been determined in Step (2). Note that although we explain this 
strategy assuming that there is a single block node (£), it is straightforward to 
extend the approach to multiple block nodes. Note also that since our approach 
is essentially basic block oriented, we can expect its effectiveness to increase 
when it is used in conjunction with techniques that increase basic block sizes 
(e.g., superblocks/hyperblocks). 

(i) 



® ® ® ® 

Fig. 4. (i) An example CLTG. (ii) An access graph for partially fixed storage sequence, 
(iii) Selected maximum weight cover. 





5 Intra-procedural Optimization Strategy 

We now present a unified strategy that employs both access sequence and storage 
sequence transformations to make effective use of address registers. The approach 
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works on a representation called weighted eontrol flow graph (WCFG), which is a 
CFG with weighted nodes (basic blocks). A node weight specifies the number of 
times the corresponding basic block is entered (dynamic execution frequency). 
This is typically calculated by considering the execution frequencies of edges and 
branch probabilities. 

Our approach to this global (procedure-wide) optimization problem is as 
follows. After determining the execution frequencies of basic blocks and labeling 
them, we visit basic blocks one-by-one, and optimize a basic block completely 
before moving to the next one. The optimization order is determined by the 
weights (i.e., basic block labels). 

The first (most frequently executed) basic block is optimized using Liao’s 
heuristic (explained in Section[2]). After optimizing this basic block, we determine 
a storage sequence for all the variables accessed by this basic block. Note that this 
step determines only a partial storage sequence (called the storage subsequenee) 
as the variables accessed by this block form, in general, a subset of all the 
variables declared in the program. Then, we move to the next most frequently 
executed basic block, and optimize it using the approach explained in Section [3] 
or Section |4] depending on whether all the variables manipulated by this basic 
block has already fixed memory (storage) locations or not. After optimizing this 
basic block, new storage subsequences (for the variables accessed by this second 
most frequently executed basic block, but not accessed by the most frequently 
executed basic block) are determined. Afterwards, we move to the third most 
frequently executed basic block and, in optimizing it (using the techniques given 
in Section 0 and Section m, we take into account all the storage sequences 
determined so far. In this way, our approach handles the basic blocks one-by- 
one, and in optimizing each of them, it considers the storage sequences found so 
far. If at a given point, the storage location for each variable in the code is fixed 
(i.e., a complete storage sequence is determined), the remaining basic blocks are 
optimized using the technique discussed in Section [31 At the end of the process, 
if the storage sequences found do not form a single connected component, they 
are made so using a post-processing pass. 

6 Summary 

In this work, we have presented a compilation framework that employs both 
program restructuring and storage order optimizations to reduce the size of the 
generated code for embedded processors by eliminating as many explicit address 
register loads as possible. Reducing code size is extremely important as in many 
embedded systems a reduction in code size means a reduction in memory size. 
Work in progress includes the investigation of different ways of combining stor- 
age layout and code restructuring transformations, incorporating partitioning 
of variables among different address registers, and studying the impact of SSA 
transformation on code size. We also plan to make experiments with different 
architectures as different instruction set architectures (ISA) can lead to different 
code sizes [B]. 
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Abstract. Offset assignment is a highly effective DSP address code optimization 
technique that has been implemented in a number of ANSI C compilers. In this 
paper we concentrate on a special class of offset assignment problems called “sim- 
ple offset assignment” (SOA). A number of SOA algorithms have been proposed 
recently, but experimental results and direct comparisons are still sparse. This 
makes the practical selection of a suitable SOA algorithm for implementation in a 
compiler very difficult. This paper aims at closing this gap by providing a compre- 
hensive benchmark suite and empirical evaluation based on real-life application 
programs. Our results for the first time permit a detailed assessment of all major 
SOA algorithms. In addition, we propose a new and superior combination of SOA 
heuristics. 



1 Introduction 

Due to the increased importance of software in embedded system design, code opti- 
mization techniques for embedded processors, particularly for digital signal processors 
(DSPs), have gained high interest in academia and industry. As compared to general- 
purpose processors, DSPs show a number of special hardware features, many of which 
impose new challenges on compiler construction: 

- Harvard architecture with separate program and data buses 

- Dual memory banks for high data access bandwidth 

- Hardware multiplier for fast product computation 

- DSP-specific instructions like multiply- accumulate, multimedia (SIMD) instruc- 
tions, and saturating arithmetic 

- Limited amount of instruction-level parallelism 

- Inhomogeneous register set 

- Support for zero-overhead hardware loops 

- Real-time capabilities 

- Dedicated address generation units (AGUs) 

This paper considers code optimization techniques aiming at maximum utilization 
of AGUs. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 290-IMI 2003. 
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1.1 Address Generation Units and Offset Assignment 

Offset assignment is a central code optimization technique in many C/C++ compilers 
for DSPs. It exploits the fact that many standard DSPs (e.g. TI C2x/C5x, Motorola 
56xxx, Analog Devices 21 Ox, ST D950) as well as numerous application- specific DSPs 
comprise an AGU that is capable of performing address (or pointer) arithmetic in parallel 
to the main data path. 




Fig. 1. Address generation unit (AGU) architecture in DSPs with address register (AR) and modify 
register (MR) files. 



A typical DSP AGU (see fig. [D comprises a file of address registers (ARs) that store 
pointers for indirect memory addressing modes. In order to optimize clock speed and 
to save silicon area, DSPs, in contrast to CISC and RISC machines, frequently do not 
support “base-plus-offset” addressing modes. Instead, in order to compute a new address 
a' = a±c from a given address a stored in some AR, that AR has to be explicitly modified 
by adding or subtracting some constant c. The code efficiency of such AR modifications 
depends on the concrete value of c: if the absolute value of c is small enough such that c 
fits into the auto -increment range R = [— r, r], then c can be encoded as an immediate 
operand into the same instruction that performs a memory access (LOAD or STORE) at 
address a. In that case, the AR modification can be performed within the AGU in parallel 
to the memory access by means of an auto-increment (or auto-decrement, dependent on 
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Fig. 2. Illustration of offset assignment 



the sign of c) operation. Otherwise, if |c| > r, an extra instruction is required to compute 
the address a' = a ± c for the next memory access. 

Hence, the auto-increment based address computation results in the highest code 
performance and density, and any C/C-f-f compiler for DSPs should aim at maximizing 
its use when generating code for address computations. One way to do this is offset 
assignment, where the memory layout for program variables is optimized such that the 
maximum number of address computations for scalar variables can be implemented by 
auto-increment. This is possible due to the fact that the stack layout for the local scalar 
variables of a C function can be freely chosen by the compiler. 

1.2 Offset Assignment Example 

Fig. |2l illustrates a sample stack frame layout in a DSP-specific compiler. The compiler 
typically allocates one of the ARs as sl frame pointer (FP), which is used to address local 
variables on the stack. Suppose, we have three such variables. A, B, and C, which are 
accessed in the sequence S = (A, C, A, B). Furthermore, suppose the auto-increment 
range R is restricted to [—1,1]. This special case of using a single FP and R = [—1,1] 
is called simple offset assignment (SOA). 

The upper part of fig. E] illustrates the situation when the variables are assigned to 
stack locations (or offsets, relative to the stack frame boundary) n, n + 1, n + 2, in 
alphabetic order. Initially, FP points to variable A at address n. The next access goes to 
C located at n + 2. Due to the missing “base-plus-offset” addressing mode in DSPs, FP 
cannot remain constant throughout the entire function execution (as it normally holds for 
CISC or RISC compiled code), but needs to be implemented as di floating or roving frame 
pointer. Thus, in order to access the variables according to sequence S, FP needs to be 
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modified by the values +2, —2, +1, in that order. Due to R = [—1,1], only the last FP 
modification (+1) can be implemented by auto-increment, while two extra instructions 
are required to implement the modifications by -\-2 and -2. However, as shown in the 
lower part of fig. |2j the situation changes drastically when the variables are assigned 
to memory addresses in the order B — A — C, in which case the access sequence S 
implies FP modifications by +1, —1, —1, all of which fall into the auto-increment range 
R. Hence, the latter variable layout will result in better code, and it is the goal of SOA 
algorithms to compute such “good” variable layouts. 

1.3 Motivation 

Experimental surveys indicate that it is not unusual for DSP machine code to comprise 
20%-30% (sometimes even more than 50%) of instructions used for address compu- 
tations im, Ha. In terms of total code size, the effect of performing offset assignment 
within a C compiler is typically in the order 5%-20% na, which is quite significant 
for DSPs with tight ROM size constraints. Due to their high importance for DSP code 
quality, offset assignment techniques have been implemented in several research (e.g. 
SPAM ||3 or RECORD J?]) and industrial compilers (e.g. TFs C2x/C5x C compiler 0 
or CHESS @) for DSPs. 

Even though SOA is just a special case of offset assignment problems, it represents 
a real-world problem. This is due to the fact, that many DSPs show a relatively small 
instruction word length (mostly 16 bits), which allows only for a narrow auto-increment 
range like [—1, 1]. Moreover, generalized offset assignment approaches using multiple 
frame pointers mostly rely on SOA algorithms as subroutines. 

Consequently, a number of different SOA algorithms have been proposed in the 
literature. In spite of this, from a scientific viewpoint, the situation is not really satisfac- 
tory, since so far there has been no comprehensive benchmarking of the different SOA 
algorithms for real-life problems. Some algorithms have been compared to others, but 
frequently the comparisons are incomplete and are based on small program fragments or 
even random problem instances, so that reported results are hardly reproducible. So the 
question of which SOA algorithm is the “best” (w.r.t. their computation time vs solution 
quality tradeoff) is still largely open. 

Therefore, in this paper we do not just propose yet another SOA algorithm, but our 
main goal is to consolidate previous work by means of a comprehensive empirical study, 
in which we evaluate a set of different algorithms for a large suite of realistic SOA 
problem instances. This allows us to draw conclusions on which algorithms are most 
useful in practice and may be promising platforms for future offset assignment research. 
In more detail, the contributions of this paper are: 

1. We briefly review the major existing SOA algorithms and available experimental 
comparisons. 

2. We propose an extensible benchmark suite, called OffsetStone, for offset assignment 
algorithms together with the necessary tool support. 

3. We use OffsetStone to evaluate a total of 8 SOA algorithms and give detailed ex- 
perimental results about their performance in terms of computation time and (both 
absolute and relative) solution quality. 
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4. We present a new combination of two fast SOA heuristics that turns out to be superior 
to all previous heuristics. 

The remainder of this paper is structured as follows. Section El discusses related 
work and gives a more precise description of the SOA problem. Section Q] outlines the 
Offsets tone benchmarking methodology and its tools. In section |4] we provide detailed 
experimental results. Finally, section[5] gives conclusions and mentions future work. 



2 Related Work 

2.1 Access Graph Model 

Bartley [5|] proposed the access graph model for the simple offset assignment (SOA) 
problem, which forms the baseline for most SOA algorithms. Given a variable set V = 
{vi ^ . . . , Vn} and a variable access sequence S = (si, . . . , Sm) of a basic block with 
Vi G [1, m] : Si G V, the access graph is an undirected, complete, and edge- weighted 
graph G = {V,E,k) with E = {{v,w}\v,w G V}. The function k : E ^ Nq 
assigns a weight to each edge e = {v^w} that denotes the number of access transitions 
between v and w in S, i.e., the number of subsequences of S of the form {v^w) or 
{w,v). Due to the symmetry of auto-increment and auto-decrement, the ordering of 
V and w is irrelevant here. Likewise, self-edges of the form {v^v} can be neglected. 
The left part of fig. U exemplifies the access graph model for V = {A, C, D} and 
S = {D,A,C,B,A,D,A,B,C). 





Fig. 3. Access graph model and maximum weighted Hamiltonian path 



Any access transition (v^w) in S can be implemented by auto-increment, if and 
only if V and w are assigned neighboring stack locations, i.e. the offset difference of v 
and w is covered by the auto-increment range [—1, 1]. In order to maximize the use of 
auto-increment addressing, obviously those variable pairs {v,w} should be neighbors 
in the stack frame, whose edge weight k{{v , w}) in G is high, since this will save many 
extra instructions for address computation. 
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2.2 Offset Assignment Heuristics 

As pointed out by Liao ||3], the SOA problem eventually amounts to finding a maximum 
weighted Hamiltonian path P in G, i.e. a path touching each node once with the maxi- 
mum edge weight sum (see right part of fig.O. The memory layout is derived from P 
by assigning those node pairs to adjacent memory locations, which are also neighboring 
in P (i.e. either C-B-A-D or D-A-B-C in the example from fig. [ 5 ]). 

The cost of an SOA solution P is defined as the sum of the weights of G’s edges not 
covered by P. This corresponds to the number of extra address computation instructions 
to be inserted into the machine code. By means of a simple reduction from the classical 
Hamiltonian path problem 0 it can be shown that computing P is an NP-complete 
problem. Hence, heuristics should be used, except for small problem instances. 

Bartley i] proposed a greedy heuristic for finding path P. His algorithm iteratively 
picks an edge e of highest weight k{e) in G and checks whether inclusion of e into a 
partial path P would still allow for a valid solution. This is iterated until a complete path 
with |k^| — 1 edges has been selected. 

Liao Q proposed a more efficient implementation of Bartley’s SOA algorithm, 
by temporarily neglecting edges of zero weight (which are frequent in realistic access 
graphs) and using an efficient Union/Find data structure for checking for cycles. Besides 
the implementation issues, Liao’s algorithms produces the same results as Bartley’s. 

In his thesis m, Liao additionally proposed a branch- and-bound (B&B) algorithm 
for SOA, which can be used to construct optimal solutions. The B&B algorithm is capable 
of effectively pruning the huge search space, but it can generally only be applied to small 
problem instances due to sometimes exhaustive runtime requirements. 

Both Bartley’s and Liao’s heuristics do not include a special handling of edges with 
equal weight during path construction. However, same- weight edges are very common 
in access graphs, and the solution quality may critically depend on the order in which 
edges are investigated during path construction. Therefore, Leupers and Marwedel 0J 
proposed to extend Liao’s algorithm by a tie-break heuristic for choosing among same- 
weight edges. An experimental evaluation for a set of random SOA problem instances 
indicated that the tie-break heuristic on average gives a slight improvement over Liao’s 
heuristic. This has been confirmed by independent experiments in ||8], |I9], while other 
experiments on some of the DSPStone ifTol benchmark programs reported in [2|] did not 
indicate such an improvement. 

A genetic algorithm (GA) based approach to SOA has been presented in fTTI . In 
contrast to most other methods, it does not use the access graph model, but constructs 
offset assignments directly by a (relatively time-consuming) simulation of a natural 
evolution process. Actually, the GA has been mainly intended for a more general class 
of offset assignment problems, but it can easily be restricted to solve the SOA problem. 
A direct comparison to fast heuristics for the special case of SOA has not been reported, 
though. 

Atri et al. proposed an incremental SOA algorithm [IT^ . It starts with an initial SOA 
solution, constructed by some heuristic, and performs an iterative improvement by a local 
exchange of access graph edges selected for the maximum weighted Hamiltonian path. 
An experimental comparison to Liao’s heuristic 0 for a set of random SOA instances 
indicated that the initial solution can be improved in 3-8% of the cases considered. 
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where the average improvement is about 5%. Unfortunately, no comparison to other 
SOA algorithms was reported. 

Besides these approaches, many generalizations of SOA have been considered, in- 
cluding the general offset assignment (GOA) problem a, a, E], mil that handles 
multiple frame pointers, DSPs with auto-increment operations between the memory 
accesses 03, auto-increment ranges beyond ±1 cm, C3, ms, AGUs with modulo 
addressing modes [13, exploitation of scheduling freedom in the variable access se- 
quence a, as well as procedure-level offset assignment G3- Other researchers have 
dealt with DSP-specific compiler techniques for address register assignment in case of 
arrays and predefined memory layouts (e.g. (31, (HI, (SO), ||3T], ||22l , f23|] ), which are 
not directly related to SOA. 

3 Evaluation Methodology 

Summarizing the discussion of SOA algorithms in section |3] many techniques have not 
been directly compared to each other so far, while the few comparisons that do exist 
are mostly based on small data bases or random problem instances. However, random 
instances generally do not well reflect real-world problems, since the latter tend to show 
higher locality in the variable access sequences. 

3.1 OffsetStone Benchmarks 

For sake of a more reliable and reproducible evaluation of available SOA algorithms, we 
have composed OffsetStone, a large suite of SOA problem instances extracted from 31 
complex real-world application programs written in ANSI C. These include computation- 
intensive DSP applications (e.g. MPEG2, MP3, ADPCM, DSPS tone, FFT, JPEG, GSM, 
Viterbi) but also more control-dominated standard applications (e.g. GZIP, FLEX, BI- 
SON, CPP). Altogether, the C applications chosen for OffsetStone comprise more than 
300,000 lines of C source code. They are certainly representative and much broader than 
what has been used for SOA benchmarking in previous work. 

From a benchmarking viewpoint, an interesting observation is that there are no 
significant differences in the behavior of the SOA algorithms for different benchmark 
types (i.e. DSP or general-purpose). Therefore, there was no need to restrict the evaluation 
to DSP applications only. 

For each application program, we extracted SOA problem instances by means of the 
following steps: 

1. The ANSI C sources for the application are translated into a three address code 
intermediate representation (IR) by means of the LANCE C frontend f24ll . in order to 
make the variable access sequences explicit. Additionally, this step inserts temporary 
variables for intermediate results, that a compiler would normally generate. 

2. The IR is optimized by standard techniques used in most compilers, including com- 
mon subexpression elimination, dead code elimination, constant folding, jump op- 
timization, etc. This step ensures that the IR does not contain superfluous variables 
and computations, which a compiler would eliminate anyway. 
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3. From the optimized IR, the detailed variable access sequence is extracted from each 
basic block. 

4. Since any offset assignment is valid throughout an entire C function, one global 
access graph is constructed per function by merging the local access graphs of the 
basic blocks. In this way, all local access sequences are represented in a single graph. 
Each global access graph forms one instance of the SOA problem. 

With this methodology we obtained a total of more than 3000 realistic SOA problem 
instance The extraction is restricted to variables fitting into a single memory word, i.e., 
variables that directly qualify for offset assignment. We also excluded pointer variables, 
since these are mostly allocated in address registers and not on the stack frame. 

Our approach assumes that all variables extracted will actually be assigned to the 
stack. This is not necessarily true, since a compiler generally will be able to keep some 
of the variables in the data path registers. However, as DSPs with AGUs typically show 
very few data path registers, it is reasonable to assume that the extracted sequences are 
very close to the actual access sequences in compiled code. 



3.2 SOA Algorithms Included in OffsetStone 

For the extracted benchmarks, we evaluated the following 8 SOA algorithms: 

1. SOA-OFU: A trivial offset assignment algorithm, where variables are assigned to 
offsets in the order of their first use in the code. This order would typically be used 
in non-optimizing compilers without a dedicated SOA phase, and thus serves as a 
baseline case for our experiments. 

2. SOA-Bartley: Bartley’s SOA heuristic O based on the access graph model. 

3. SOA-Liao: Liao’s SOA heuristic Q based on the access graph model. 

4. SOA-BB: Liao’s branch-and-bound algorithm |T1 for optimally solving SOA. 

5. SOA-TB: SOA-Liao extended by the tie-break heuristic proposed in a. 

6. SOA-GA: The genetic algorithm for SOA from ifTTl . 

7. SOA-INC: The incremental SOA algorithm from [ 13 , using SOA-Liao for con- 
structing initial solutions. 

8. SOA-INC-TB A new combination of SOA algorithms, using SOA-INC in combi- 
nation with SOA-TB for constructing initial solutions. As will be shown later, using 
SOA-TB instead of SOA-Liao in total results in a higher optimization potential for 
SOA-INC. 

^ In this formulation, SOA minimizes code size. For performance optimization, profiling in- 
formation can be exploited by assigning higher edge weights to frequently executed program 
paths. 

^ The OffsetStone benchmark access sequences are available from the author upon request, 
including the corresponding tools for access sequence extraction and the C++ source code for 
our implementation of the 8 SOA algorithms. This allows other researchers to easily reproduce 
the results, to add more offset assignment algorithms to the existing infrastructure, and to extend 
the benchmark suite by extracting access sequences from further application programs. 
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4 Experimental Results 

The 8 SOA algorithms have been implemented in C++ in the form of different routines 
within a single driver program. Naturally, high attention has been paid to uniform soft- 
ware engineering practices, in order to ensure a fair comparison. The algorithms have 
been applied to all OffsetStone benchmarks, where the costs (according to the metric 
defined in sectionO and the CPU times (on a 1.3 GHz Linux PC) have been measured. 
An exception, however, is the SOA-BB algorithm. Due to the sometimes excessive run- 
time requirements, we restricted its use to problem instances with at most 12 variables 
(this already corresponds to 12! 479 • 10^ possible solutions). 

4.1 Performance Relative to SOA-OFU 

We first focus on a comparison to the “naive” algorithm SOA-OFU. Tabled gives the 
average percentage of the solution cost of 6 SOA algorithms (SOA-OFU set to 100%, 
SOA-BB not included here due to runtime limitations). SOA-Bartley and SOA-Liao are 
combined into a single column since they always produce identical results. 

The line labeled “average” in table [H shows the average cost values over all Offset- 
Stone benchmarks. As can be seen, all SOA algorithms reduce the cost as compared to 
SOA-OFU by about 25% on average, with a relatively small difference to each other (the 
reason for this will become clear in table EJ. The best results are produced by SOA-GA, 
followed by SOA-INC-TB and SOA-TB. 

For sake of completeness, we also applied the algorithms to random access sequences, 
as it has been frequently done in previous work. The line labeled “random” in table [H 
shows the average results obtained after applying the SOA algorithms to a set of 3000 
random SOA problem instances with varying numbers of variables and access sequence 
lengths as they typically occur in practice. Even though the order of result quality does 
not change, the performance difference between the algorithms is smaller, and the result 
quality as compared to the naive algorithm SOA-OFU is much lower (< 8%) than for 
real SOA problems. This can be explained by the fact that the edge weights in the access 
graph are more uniformly distributed for random sequences than for real sequences. This 
means that there are no big “peaks“ in the objective function so that even optimal SOA 
solutions are not much better than naive (SOA-OFU) solutions. Hence, the optimization 
potential for SOA algorithms is significantly lower. This confirms our above statement 
that random problem instances are not the best choice for evaluating SOA algorithms. 

4.2 Runtimes 

Table[3 shows results on the average runtime requirements (CPU milliseconds) per SOA 
problem instance. SOA-OFU is not included, since it requires essentially no processing 
time at all. Note that SOA-Bartley in its original form can only be used for small problem 
sizes, due to its very high runtime requirements. However, we found that it can be easily 
accelerated by temporarily suppressing the zero- weight edges in the access graph. The 
first line in table |2l therefore, refers to this improved implementation of SOA-Bartley. 
Nevertheless, its “twin” algorithm SOA-Liao is still faster on average. 
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Table 1. Relative cost of SOA algorithms compared to SOA-OFU solutions (100%) 



benchmark 


Liao 


TB 


INC 


INC-TB 


GA 


8051sim 


83.1 


79.8 


80.7 


79.0 


79.0 


adpcm 


81.1 


79.3 


80.1 


78.6 


78.5 


anagram 


68.9 


66.9 


68.2 


66.2 


65.6 


anthr 


81.1 


79.9 


80.9 


79.9 


79.9 


bdd 


78.6 


76.9 


78.4 


76.9 


76.9 


bison 


78.2 


77.1 


78.1 


77.0 


77.0 


cavity 


85.1 


82.4 


84.6 


82.2 


82.2 


cc65 


78.4 


76.3 


77.2 


76.3 


76.2 


codecs 


81.5 


80.3 


81.4 


80.3 


80.3 


cpp 


77.4 


76.3 


77.3 


76.3 


76.3 


dct 


77.6 


77.8 


77.6 


77.4 


77.4 


dspstone 


76.4 


74.4 


76.0 


74.3 


74.3 


eqntott 


65.0 


65.0 


65.0 


65.0 


65.0 


f2c 


73.7 


72.7 


73.6 


72.6 


72.6 


fft 


92.0 


92.0 


92.0 


92.0 


92.0 


flex 


71.3 


69.3 


71.0 


69.3 


69.3 


fuzzy 


77.5 


74.2 


77.0 


74.2 


74.2 


gif2asc 


83.1 


82.0 


83.0 


81.7 


81.7 


gsm 


81.5 


80.9 


81.3 


80.9 


80.8 


gzip 


77.1 


73.2 


76.3 


73.2 


73.2 


h263 


70.3 


70.0 


70.0 


70.0 


69.6 


hmm 


70.5 


67.4 


69.8 


67.3 


67.3 


jpeg 


73.7 


71.8 


73.4 


71.7 


71.6 


kit 


68.2 


66.1 


67.6 


66.1 


66.0 


Ipsolve 


78.1 


77.1 


77.8 


77.1 


77.1 


motion 


90.6 


91.1 


90.6 


89.6 


89.6 


mp3 


72.3 


71.6 


72.2 


71.6 


71.4 


mpeg2 


77.0 


76.0 


76.8 


75.9 


75.8 


sparse 


75.9 


75.1 


75.9 


75.1 


75.1 


triangle 


65.8 


64.4 


65.6 


64.4 


64.3 


viterbi 


89.3 


85.0 


89.1 


84.9 


84.9 


average 


76.71 


75.23 


76.40 


75.16 


75.10 


random 


92.74 


92.24 


92.62 


92.17 


92.13 



Table 2. Average runtime per problem instance 



Algorithm 


CPU time (msecs) 


SOA-Bartley 


0.97 


SOA-Liao 


0.67 


SOA-TB 


0.68 


SOA-INC 


4.60 


SOA-INC-TB 


23.00 


SOA-GA 


8296.26 
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Table 3. Average overhead compared to optimum 



Algorithm 


% overhead 


SOA-BB 


0.00 


SOA-OFU 


67.09 


SOA-Liao 


4.34 


SOA-TB 


0.16 


SOA-INC 


2.28 


SOA-INC-TB 


0.11 


SOA-GA 


0.00 



The average runtimes are mostly in the order of milliseconds or even less, with SOA- 
Liao and SOA-TB being the fastest algorithms. There is a big gap to SOA-GA though, 
which on average needs about 8.3 CPU seconds per problem instance. This leads to a 
clear separation of SOA algorithms into fast and slow ones, where the latter category 
comprises SOA-GA and SOA-BB. 



4.3 Performance Relative to Optimum 

For about 41% of all benchmark problems (i.e. the “small” problems with at most 12 
variables), we computed optimal solutions by means of the SOA-BB algorithm. This 
allowed us to measure the absolute quality of computed SOA solutions. The results 
are given in table [3] which shows the average percentage of cost overhead compared 
to the optimal solutions for each algorithm. Naturally, the trivial algorithm SOA-OFU 
shows the highest overhead. As can be seen, all heuristics get more or less close to the 
optimum, which explains the small differences found in table [T] SOA-Liao yields an 
average overhead of 4.34%, while SOA-INC-TB is the best of the fast heuristics, with 
an overhead of only 0.1 1%. SOA-GA found the optimum in all cases. For the test cases 
covered by table [3 SOA-BB needed about 3.5 CPU seconds per SOA instance, while 
SOA-GA took 0.8 CPU seconds. The CPU times of the fast heuristics are negligible in 
practice. 

5 Conclusions 

Given that the Offsets tone benchmarks provide a good representation of real-world SOA 
problems, the experimental data from section[4|permit to draw the following conclusions 
that were not available from previous work: 

- Generally, the performance difference between SOA algorithms for real problems 
is surprisingly small. Hence, it might appear that the concrete algorithm used in a C 
compiler for DSPs does not matter much. However, under the tight cost constraints 
of embedded systems where sometimes every program ROM word matters, the best 
algorithm with an acceptable runtime should certainly be chosen. 

- SOA-Bartley can be easily implemented much more efficiently than in the originally 
proposed form, but SOA-Liao is still faster while giving the same results. 
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- SOA-TB achieves better average results for real-life problems than SOA-Liao/SOA- 
Bartley at virtually no increase in computation time, and it also achieves better 
solutions than SOA-INC. 

- The new combination of SOA algorithms (SOA-INC-TB) proposed in this paper 
achieves the best results of all fast heuristics tested here. Hence, it can be recom- 
mended for fast compilers and can replace the use of SOA-Bartley/Liao, SOA-TB, 
and SOA-INC. At least for “small” problems it achieves an extremely low average 
overhead compared to optimal solutions. 

- In case priority is given to highest code quality and not to high compilation speed 
(say, in a final compiler run with highest optimization effort to generate production 
code with minimal ROM size), the SOA-GA algorithm should be preferred. For 
“small” SOA problems, SOA-BB can be used to compute optimal solutions, but we 
observed that SOA-GA finds the optimum in virtually all cases (even though it is not 
guaranteed to do so) at less than 25% of the computation time requirements of SOA- 
BB. SOA-BB is frequently fast but sometimes shows extreme peaks in computation 
time due to its branch- and-bound nature, whereas the runtimes of SOA-GA are 
predictable. 

- The use of random access sequences for evaluation of SOA algorithms, though quite 
common in previous research, does not accurately reflect the algorithm behavior for 
real applications. Our experimental results indicate that random sequences do allow 
for a coarse performance comparison between algorithms, but they definitely do not 
exhibit their optimization potential for real-life application code. 

Offsets tone is the first effort towards fair benchmarking of offset assignment algo- 
rithms based on a huge suite of realistic problem instances. It allowed us to provide an 
in-depth evaluation of most state-of-the-art SOA algorithms. The results provide valu- 
able hints both for compiler developers and researchers working on offset assignment 
in C compilers for DSPs. As a secondary contribution, we were able to identify a new 
combination of fast heuristics (SOA-INC-TB) that is superior to previous algorithms. 

As a first step, in this paper we have focused only on SOA, the most basic class of 
offset assignment problems. In the future, the suite of algorithms included in Offsets tone 
will be extended to also cover generalized offset assignment problem formulations, 
e.g. offset assignment with variable live range information, exploitation of scheduling 
mobility of instructions, or general offset assignment with multiple address registers, 
some of which have been mentioned in section |2] 
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Abstract. The High-Level Optimizer (HLO) is a key part of the compiler tech- 
nology that enabled Itanium™ and Itanium™2 processors deliver leading float- 
ing-point performance at their introduction. In this paper, we discuss the design 
and implementation experience in integrating diverse optimizations in the HLO 
module. In particular, we describe decisions made in the design of HLO target- 
ing Itanium processor family. We provide empirical data to validate the design 
decisions. Since HLO was implemented in a production compiler, we made 
certain engineering trade-offs. We discuss these trade-offs and outline key 
learning derived from our experience. 



1 Introduction 

The Explicitly Parallel Instruction Computing (EPIC) technology behind the 
Itanium^M processor architecture provides a rich set of features [3,4], which allow the 
compiler to exploit instruction-level parallelism (ILP) and optimize applications in 
many new ways. Intel’s compiler for Itanium processor family incorporates and ex- 
tends the latest optimization techniques, and new techniques have been designed spe- 
cifically for the Itanium architecture [3,4]. As a result, the Intel compiler helped de- 
liver the world's best floating point performance during the introduction of the Itanium 
and Itanium2 processors. The High-Level Optimizer (HLO) has been a key compo- 
nent that helped achieve this performance. Broadly, HLO encompasses optimizations 
that operate on high-level program structures such as loops and arrays. In this paper, 
we discuss the design and implementation of HLO targeting Itanium processor family 
and describe key learning out of this experience. 

Processor speed has been increasing much faster than memory speed over the past 
several generations of processor families. HLO component in the Intel compiler for 
the Itanium processor applies loop-based and region-based control and data transfor- 
mations in order to: i) improve data access behavior with memory optimizations, ii) 
maximize resource usage in innermost loops, and iii) expose higher instruction-level 
parallelism. 

G. Hedin (Ed.): CC 2003, LNCS 2622, pp. 303-319, 2003. 

© Springer- Verlag Berlin Heidelberg 2003 
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In HLO, we have implemented numerous well-known and new transformations, 
and more importantly, we combined and extended these transformations in special 
ways so as to exploit the Itanium™ processor architecture features for higher applica- 
tion performance. Fig. 1 shows the contribution of optimizations in HLO to the per- 
formance of the SPECfp2000 benchmark suite, consisting of 14 F77/F90/C programs. 
All experiments in this paper were conducted using version 6.0 Beta of the Intel Com- 
piler for Microsoft Windows 2000/XP on an 800MHz Itanium processor based system 
with 4MB L3 cache. The graph shows the performance improvement or serial speedup 
over baseline. The baseline contains all optimizations excluding HLO used for SPEC 
base reporting. In particular, baseline includes inter-procedural optimizations and pro- 
file feedback. 



CL 





Fig. 1. Serial speedup due to optimizations in HLO. Cgeo’ is geomean) 




Fig. 2. Impact of individual optimizations in HLO on SPECfp2000 performance 

Substantial performance gains from HLO are the result of selection of a large rep- 
ertoire of transformations, design decisions that took into account the details of the 
Itanium architecture and careful and iterative phase-ordering decisions. This paper de- 
scribes the experience in designing and implementing the HLO. The objective of this 
paper is to: 

• Describe design decisions made during building of HLO targeting Itanium proc- 
essor family. 

• Present experimental results that validate the design decisions. 
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• Locality optimizations [1, 9]: linear loop transformations, loop fusion , loop dis- 
tribution and strip-mining. 

• Discuss the key learning and engineering trade-offs in a production compiler. 

The rest of the paper is organized as follows. Section 2 provides design considera- 
tions for optimizations in the HLO targeting the Itanium processor family. Section 3 
validates the design decisions with empirical data that show the impact of individual 
HLO transformations and how they interact with rest of the transformations. Key 
learning appears in Section 4 and concluding remarks in Section 5. 



2 Design Considerations Targeting the Itanium™ Processor 

In this section, we outline the design considerations for various optimizations in HLO 
while targeting the Itanium processor. The optimizations in HLO have been designed 
and implemented with a conscious effort to exploit the features in the Itanium proces- 
sor architecture. In HLO, we have implemented many transformations that fall under 
these broad categories: 

• Locality optimizations [1, 9]: linear loop transformations, loop fusion , loop dis- 
tribution and strip-mining. 

• ILP optimizations: unrolling, register blocking, affine-condition unswitching, and 
load-pair insertion. 

• Maximize resource usage: Scalar replacement of memory references, affine- 
condition unswitching, and load-pair insertion. 

• Data prefetching. 

• State-of-the-art dependence and section analysis to support optimizations [1]. 

Fig. 2 shows the impact of individual optimizations in HLO on the performance of 

the SPECfp2000 benchmark suite. In the graph, x-axis shows the optimizations or 
groups of optimizations. Here stands for data-prefetching, lit for linear loop trans- 
formations, dist for loop distribution and strip-mining, for fusion, sc for scalar re- 
placement, and finally ILP stands for unroll-and-jam, affine-condition unswitching, 
and load-pair insertion. The y-axis is the percentage improvement because of the HLO 
optimizations over all other optimizations in our compiler. 

This performance improvement is the result of 1) Itanium-architecture conscious 
design of optimizations and 2) careful orchestration of interaction between optimiza- 
tions. Consequently, we are able to derive an efficient phase-ordering in our HLO 
which is shown in Fig. 3. (The figure does not include demand-driven calls to optimi- 
zations as a way of implementing certain other optimizations.) There exists no estab- 
lished technique to derive an optimal phase-ordering which is a computationally hard 
problem. In the subsections that follow, we describe design criteria for several optimi- 
zations targeting Itanium™ processor family. We also describe the rationale in posi- 
tioning each optimization relative to other optimizations in the phase-order. These ra- 
tionales together form the basis for a partial order of optimizations. The final phase 
order was derived from this partial order by considering other constraints such as 
minimizing compile time, say by minimizing updates to the dependence graph, and 
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ease of maintenance. In Section 3, we evaluate the current phase-order by providing 
empirical data on the interaction among HLO optimizations - a positive interaction 
validates the choices and a negative interaction shows opportunity for improvement. 



Loop Recognition and Normalization 

4 " 

Construct Dependence Graph 

4 

Loop Distribution 

4 

Linear Loop Transformations 

4 

Loop Fusion 

4 

Loadpair Detection 



Affine Condition Unswitching 

4 

Block Unroll and Jam 
> 1 ^ 

Loop Fusion 
Scalar Replacement 

4 

Data Prefetching 

4 

Loadpair Insertion 



Fig. 3. Current phase-ordering of optimizations in HLO 



2.1 Locality Optimizations 

Caches are an important hardware means to bridge the gap between processor and 
memory access speeds. However, programs, as originally written, may not effectively 
utilize the available cache. Our design consideration was to: 

• implement all loop transformations that are well-known in the community and in- 
dustry to improve the locality of data reference [1, 2,7,9]. 

• account for the fact that L2 is first level at which floating-point data may reside. 

• account for the benefit of data prefetching in reuse models that trigger many loop 
transformations. 

We have implemented linear loop transformations, loop fusion, and loop distribution 
in the Intel compiler to improve data locality. As a combined effect, linear loop trans- 
formations [6,9] can dramatically improve memory access locality. They can also im- 
prove the effectiveness of other optimizations, such as scalar replacement, invariant 
code motion, and software pipelining. For example, a loop interchange can make ref- 
erences to arrays to be inner-loop invariant, besides improving the access behavior of 
other references. Important design considerations are: 

• Make sure phase order is such that linear loop transformations occur sufficiently 
early to enable other transformations. 

• Software pipelining [5] is important on the Itanium processor family, so use linear 
loop transformations to improve effectiveness by enabling parallelism in inner 
loops, and by interchanging, whenever possible, so that innermost loops have suf- 
ficiently large counts. 

• Linear loop transformations have to be aware of the benefits of exposing refer- 
ences with spatial locality for effective data prefetching. 
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Loop fusion [1,8] is effective in improving cache performance, since it combines 
the cache context of multiple loops into a single new loop. Thus, data reuse across 
nested loops is within the same new nested loop. Loop fusion increases opportunities 
for reducing the overhead of array references by replacing them with references to 
compiler-generated scalar variables. Loop fusion also improves the effectiveness of 
data prefetching. While implementing loop fusion for the Itanium processor family, 
one must consider: 

• 128 floating-point and 128 general registers available in the Itanium processor 
family. This allows aggressive loop fusions without the risk of register pressure. 
The design allows for loop fusion across call boundaries, and code motion to en- 
able loop fusion. 

• Trade-offs between locality and instruction level parallelism. For example, a 
fused loop may not be software pipelined because of dependences. In this case, 
the benefits of locality must outweigh the benefits of exploiting ILP across the 
back-edge of the fused loop. 

Besides enabling other transformations, loop distribution [1,9] spreads the poten- 
tially large cache context of the original loop into different new loops, so that the new 
loops have manageable cache contexts and higher cache hit rates. While designing 
loop distribution for the Itanium processor family, the design considerations included 
the following: 

• As in other compilers, use loop distribution to create perfect nests, enable loop 
interchange and loop blocking. 

• Use loop distribution to partition a loop into loops with calls or non-inlined intrin- 
sic and loops without calls. This is because, loops with calls cannot be software 
pipelined in the current compiler. 

• Large loops may be distributed to avoid running out of rotating registers in soft- 
ware pipelining. This requires tradeoff between loss of locality and benefit of 
software pipelining. 

• Ability to expose ILP across loop back-edges has sufficiently higher benefit to tilt 
the balance towards expansion of scalar variables to enable loop distribution. 

Phase-ordering constraints: Loop distribution, linear loop transformations, and loop 
fusion are run in that order. Loop distribution exposes perfect nests and thus opportu- 
nities for linear loop transformations. Together, they expose opportunities for loop fu- 
sion. All three rely on the same cost model for better synergy. 



2.2 ILP Optimizations 

In this section we describe unrolling while the other ILP optimizations are covered in 
Section 2.3. The design of the Intel compiler for the Itanium processor unifies loop 
blocking, unroll- and-j am [1,9], and inner loop unrolling. Loop unrolling exposes par- 
allelism across instructions in adjacent loop iterations. The large number of registers 
in the Itanium processor architecture enables the compiler to unroll loops by signifi- 
cantly larger factors without register spills than compilers for other contemporary ar- 
chitectures. This feature can be used to expose outer loops as new inner loops for 
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software pipelining. While designing block-unroll-jam for the Itanium processor fam- 
ily, we had the following considerations: 

• Unroll aggressively so as to extract greater ILP using large register file, and ex- 
pose outer and larger loops to software pipelining and effective data prefetching. 

• While blocking, consider the interaction of prefetching, need for a larger loop it- 
eration count for efficient software pipelining, available bandwidth, and primarily 
the ability to issue 2 Fused-Multiply-Add (FMA) instructions in a cycle. 

• Unroll and unroll- and-j am to maximally use machine resources and avoid frac- 
tional II loop body that under-utilizes machine resources. 

• Use unrolling to expose more opportunities for loop fusion, insertion of load-pair 
instructions, and maximal resource usage. 

• Pay attention to compile time since loop unrolling increases code size linearly, 
and the increase could adversely affect later optimizations that are quadratic or 
cubic in compile- time complexity. 

Phase-ordering constraints: From the discussion above, it is clear that unroll-and- 
jam should be performed after loop distribution, interchange, and fusion, but before 
data prefetching and scalar replacement. We will find later that load-pair insertion has 
to be done after unrolling as well. However, note that there is an advantage to per- 
forming loop fusion again after loop unrolling as it exposes more conforming loop 
nests to loop fusion. Thus there is a second call to fusion after unroll- and-j am as 
shown in Fig. 3. 



2.3 Maximizing Resource Usage 

This category of optimizations includes scalar replacement of memory references, 
load-pair insertion, and affine-condition unswitching. Scalar replacement [2] is a tech- 
nique to replace memory references with compiler-generated temporary scalar vari- 
ables that are eventually mapped to registers. Most back-end optimization techniques 
map array references to registers when there is no loop-carried data dependence. 
However, the back-end optimizations do not have accurate dependence information to 
replace memory references with loop-carried dependence by scalar variables. Scalar 
replacement, as implemented in the Intel compiler for the Itanium™ processor, also 
replaces loop invariant memory references with scalar variables defined at appropriate 
levels of the loop nesting. 

The design considerations for scalar replacement on an Itanium processor based 
platform are: 

• Map the compiler-inserted scalars directly onto rotating registers supported by the 
Itanium architecture [3]. The scalar moves required for scalar replacement [2,7] 
to preserve values across iterations are marked as MCOPY statements. The code 
generator maps the scalars to appropriate rotating registers so that explicit move 
instructions are unnecessary. 
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• Ensure that exact dependence information is available for the common cases. This 
also implies that all earlier transformations such as unrolling will have to maintain 
accurate dependence information. 

Itanium processor architecture provides instructions that load a pair of floating- 
point numbers at a time [3]. Such load-pair instructions take a single memory issue 
slot, thus possibly reducing the initiation interval of a software pipelined loop. For ex- 
ample, the loop in Fig. 4(a) has three memory operations per iteration. By using load- 
pair operations, the number of memory references can be reduced to two per iteration 
after unrolling (4(b) and 4(c)). The load-pair optimization has to take into account the 
following requirements on the Itanium processor. 

■ Load-pairs can be issued only at certain alignment boundaries, for example 16- 
byte boundary for double precision data elements. We therefore either need to 
generate code to peel off aligned portions of loops, or generate multi- version code 
for different alignment combinations. 

■ The load-pair results have to be loaded into an odd-even register pair. This re- 
quirement can only be enforced during register allocation in the code-generation 
phase of the compiler. We however chose to implement the load-pair optimization 
phase in HLO because high-level information is available to identify adjacent 
memory loads, and because we rely on loop unrolling to expose more load-pair 
opportunities than what would normally be found in user code. 

■ There must be a utility to determine the number of load-pairs that need to be in- 
serted to balance memory operations and computations. 

Affine-condition unswitching hoists conditions out of loops. This has been used in 
the compiler community to mostly expose perfect nests. However, we find that it is 
quite useful in improving the effectiveness of software pipelining as well. The initia- 
tion interval for loops with conditions tends to be much larger than what it should be 
considering that only some of the branches will be taken in any iteration of the loop. 

We use affine conditions to partition the loops into many loops, where in each new 
loop there is code corresponding to one of the paths. As a result the initiation intervals 
of the new loops will only correspond to the instructions that are always executed. A 
key design consideration was to avoid code size bloat and un- switch only the critical 
conditions. 

Phase-ordering constraints: Scalar replacement of memory references should be one 
of the last few optimizations in HLO since transformations such as loop interchange 
and unrolling expose new opportunities for scalar replacement of memory references. 
Array contraction is also performed as part of scalar replacement since they use simi- 
lar logic. Since, scalar replacement and array contraction need accurate dependence 
analysis, the dependence graphs must be rebuilt before entering scalar replacement. 

The interaction between load-pair insertion and loop unrolling influenced the de- 
sign of load-pair insertion technique and the phase-ordering between unroll-jam and 
load-pair insertion. 

■ Position insertion of load-pairs after unroll-jam, because the latter exposes op- 
portunities for inserting load-pair instructions. 
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■ However, load-pair insertion cannot call unroll-jam on demand because that 
would also require other optimizations such as scalar replacement, that eliminate 
redundant loads, to be done after unroll-jam 

These two constraints influenced us to divide load-pair insertion into two stages. 
The first stage of load-pair optimization is therefore executed immediately before loop 
unrolling, so that the loop is analyzed to estimate the number of load-pairs that may be 
generated if the loop were to be unrolled by a factor of 2. Loop unrolling will then 
factor the result into its resource usage model to determine the unroll factor. 

The second stage is where the load-pairs are actually identified. Since it is run after 
unrolling, we are able to identify load-pairs that are either derived from user’ s original 
code or exposed by loop unrolling. Because we want to prevent load-pairs from being 
applied to redundant loads, this stage is placed after scalar replacement. Scalar re- 
placement is also run after loop unrolling because the latter may expose redundant 
loads to be eliminated. 

Affine-condition unswitching has to be run after linear loop transformations, be- 
cause that is when we know that the innermost loop will be exposed to software pipe- 
lining. It is advantageous to perform affine-condition unswitching immediately after 
linear transformations. We made an engineering decision to move it later in phase- 
order and position it after load-pair detection so as to minimize updates to dependence 
graph. 



do j=l,1000 

y ( j ) ( j ) +a*x( j ) 
Enddo 
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do j=l,1000,2 
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Fig. 4. An example of the use of load pairs 



2.4 Data Prefetching 

Data prefetching is an effective technique to hide memory access latency. Prefetch in- 
structions (named Ifetch in the Itanium processor architecture) have one argument: the 
address to be prefetched. The effect of the instruction is to move the cache line con- 
taining the address to a higher level of the memory hierarchy. The address itself has 
no cache alignment requirement. 

In the example in Fig. 5, the compiler inserts prefetches for arrays a and b making 
use of the support for rotating registers in the Itanium processor architecture to mini- 
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mize the prefetch overheads. In this example, incr is a function of the cache line size, 
prefetch frequency, and the number of arrays that need to be prefetched within the 
loop. The addresses of the two arrays a and b that require prefetching are initialized 
before the loop (r33 and r34). The design considerations for a prefetching algorithm 
on the Itanium processor are: 

• Use data-locality analysis to selectively prefetch only those data references that 
are likely to suffer cache misses. References with spatial locality are selectively 
prefetched under a conditional of the form (i mod L) == 0, where i is the loop in- 
dex and L denotes the cache line size. When multiple references access the same 
cache line, then only the leading reference needs to be prefetched. 

• The cost incurred while prefetching data arises from the added overhead of exe- 
cuting prefetch instructions as well as instructions required for prefetch address 
calculation and predicate computation. The prefetch instructions will occupy 
memory slots, thereby increasing resource usage. Compute-intensive applications 
normally have sufficient free memory slots. Benefits from prefetching have to be 
weighed against the increase in resource usage in memory -intensive applications. 

• The predication support in Itanium processor architecture provides an efficient 
way of adding prefetch instructions. The conditionals within the loop are con- 
verted to predicates through if-conversion, thus changing control dependency into 
data dependency. Indirect array references are prefetched making use of specula- 
tion support to load the index array speculatively. 

• When multiple array references with spatial locality are accessed uniformly 
within a loop, prefetches can be issued with a single Ifetch instruction that uses a 
rotating register to rotate the addresses of the different arrays that must be pre- 
fetched [4]. An example of this technique is illustrated in Fig. 5. This technique 
obviates the need for predicate calculations within the loop and saves memory 
slots that would otherwise be occupied by multiple Ifetch instructions. 

• Prefetch distance is estimated based on the memory latency, the resource re- 
quirements in the loop, and data dependence information. 

The large number of registers available in the Itanium processor architecture en- 
ables prefetch addresses to be stored in registers obviating the need for register spill 
and fill within loops. 

Phase-ordering constraints: Prefetching is run after most of the other optimizations 
within HLO. This is because prefetching can benefit from a lot of these other optimi- 
zations. Loop unroller will unroll all inner loops with small trip counts to expose any 
outer loops that may have a larger trip count. This makes the prefetches more effec- 
tive. Fusion may reduce the total number of prefetches issued if the loops that are 
fused access the same data. Performing scalar replacement before prefetching ensures 
that a lot of memory references with group locality are replaced by temporary vari- 
ables, thus reducing the compile time for prefetching. 

As a whole, prefetching also interacts a lot with other optimizations outside HLO. 
For example, strength reduction is run after HLO, making sure that the addresses that 
are inserted by prefetching are strength-reduced. Also, there is a handshake between 
prefetch and the software-pipeliner that is part of the code-generator. As part of HLO, 
the compiler estimates the likelihood of a loop being pipelined. If a loop is predicted 
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to be software-pipelined, an estimate of the initiation interval of the loop based on re- 
source requirements is computed in advance. This estimate aids in the distance calcu- 
lation for the prefetches in the loop. Also when prefetch relies on register rotation, the 
address copies are specially marked (shown as MCOPY in Fig. 5) for the software 
pipeliner. These special copies are turned into automatic copies using register rotation 
by the pipeliner. 



for (i=l; i<n; 


i++) 


a(i)= b(i-l) 


+ b(i+l) 


(a) 





r33 = 80+ rl6 




add r33 = 80, rl6 


r34 = 80+ rl8 




add r34 = 80, rl8 


for(i=l; i< n;i++) 




Loop : 


{ 




(pl6) Idfd f32 = [r8] , 8 


a(i)= b(i-l) + b(i+l) 




(pl6) Idfd f37 = [r3] , 8 


r32 = r34 + incr 




(p24) stfd [r2] = f46, 8 


Ifetch.ntl [r34] 




(p20) fma f42 = f36,fl,f41 


r34 = r33 //MCOPY 




(pl6) add r32 = 16, r34 


r33 = r32 //MCOPY 




(pl6) Ifetch.ntl [r34] 


} 




br.ctop Loop 


(b) 




(c) 



Fig. 5. Prefetch example illustrating the use of rotating registers: (a) original loop, (b) prefetch 
for a and b using a single If etch instruction with rotation shown as explicit assignments, and (c) 
assembly code on Itanium™ processor with register rotation 



3 Evaluation of Design Decisions 

Certain compiler optimizations are independent in nature, in that their effect is inde- 
pendent of other optimizations. However, many compiler optimizations are highly in- 
ter-dependent. The interaction between the optimizations tends to be complex. An in- 
teraction could be positive in that an optimization enables several other optimizations 
or improves the effectiveness of other optimizations. An interaction could also be 
negative in that an optimization may disable, reduce, or mask the effectiveness of 
other optimizations. A chosen phase-ordering is most effective when all interactions 
are positive. In this section, we present experimental data for the interaction between 
optimizations in our HLO to show the effectiveness of our chosen phase ordering. 



3.1 Experimental Framework 

The data presented here are based on the performance of our compiler on the 
SPECfp2000 benchmark suite. We used version 6.0 Beta of the Intel Compiler for Mi- 
crosoft Windows 2000/XP on an 800MHz Itanium processor based system with 4MB 
L3 cache. All the experiments done here are intended to show the interaction between 
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important optimizations in HLO. Later, we also present the interaction of some other 
important compiler modules with HLO. 

We use the following notations in the discussion here: 

• OPT: Set of all HLO optimizations under consideration. 

• opt: one individual optimization in OPT. 

• P(X): Represents the performance with the optimizations in X turned on. 

In order to show the interaction among the optimizations in OPT, we measured the 
performance of all the benchmarks in SPECfp2000 with the following configurations: 

• P(BOTTOM): Baseline performance where just the optimizations in OPT are dis- 
abled. BOTTOM represents all optimizations except HLO optimizations in OPT. 

• P(TOP): Performance with all optimizations in BOTTOM and OPT turned on. This 
corresponds to the base compiler options for the reported SPECfp2000 performance 
numbers. 

• P(BOTTOM+opt): Performance of BOTTOM with optimization opt turned on. 
This data is collected for each optimization in OPT. 

• P (TOP -opt): Performance of TOP with optimization opt turned off. This data is 
collected for each optimization in OPT. 

The intuition behind collecting the above data is to find the effect of each optimiza- 
tion when applied along with other optimizations as opposed to when applied on its 
own. This shows how an optimization performs in the absence and presence of other 
optimizations and thereby provides insight into interaction of this optimization with 
the other optimizations. We can make the following observations based on the above 
data: 

1. P(BOTTOM+opt) - P(BOTTOM), say gain_at_bottom, gives the performance im- 
provement or degradation when opt is the only HLO optimization turned on. 

2. P(TOP) - P(TOP-opt), say gain_at_top, gives the performance improvement or 
degradation of opt when applied along with other optimizations. 

Clearly, when the above quantities are same for an optimization opt, then opt does 
not interact with any other optimization in OPT. In other words, if opt improved (de- 
graded) performance when applied on its own, then it would continue to improve (de- 
grade) by the same extent when applied along with remaining optimizations in OPT. 
When gain_at_top is very large compared to gain_at_bottom, then most of the benefit 
from applying opt is the result of its positive interaction with other optimizations. This 
could happen, for instance, when another optimization in OPT that is applied earlier 
enables opt or improves its effectiveness. We can also get such a scenario if opt en- 
abled a later optimization in OPT. This implies that a favorable phase-ordering was 
chosen. 

Similarly, if opt interacts negatively with the remaining optimizations in OPT, 
gain_at_top is less than gain_at_bottom. This usually suggests room for improvement 
either as tuning of an optimization or change in phase ordering. It may also be the case 
that two optimizations in OPT target the same performance issue, and the benefits ob- 
tained from the two optimizations are not additive in nature. 

The graphs presented in this section show this interaction. We explain this with re- 
spect to the graph shown in Pig. 6. We normalize all the data with respect to P(TOP)- 
P(BOTTOM). The actual speedup of P(TOP) over P(BOTTOM) was shown in Pig. 1 
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and Fig. 2. For a given optimization in OPT, we provide cumulative bar graphs for 
each benchmark performance. As shown in the legend, for a given benchmark we 
show the following sections on a bar in this order: 

1. P(BOTTOM+opt) - P(BOTTOM): Performance gain (loss) from applying only op- 
timization opt from the set of HLO optimizations in OPT. 

2. P(Top-opt) - P(BOTTOM): Performance gain (loss) from applying all optimizations 
in OPT except opt. 

3. This is the additional gain or loss to reach P(TOP)-P(BOTTOM). This is due to the 
interaction of opt with other optimizations in OPT. Positive and negative interac- 
tions are shaded differently. 

Note that if gain_at_hottom is equal to gain_at_top, i.e. when there is no interaction, 
we can easily deduce that the first two sections in a bar should add up to P(TOP)- 
P(BOTTOM) which has the value one in our graphs. Similarly, if gain_at_top is more 
than gain_at_bottom, i.e. when there is positive interaction, then P(TOP)- 
P(BOTTOM) is more than the sum of the first two bar sections. On the contrary, it is 
less than the sum of the first two sections if the interaction is negative. 




□ LLT Pother opts □ positive interaction ■ negative interaction 



Fig. 6. Interaction graph for linear loop transformations 



3.2 Analysis of Interaction Data 

The graphs and analysis provided here show that HLO optimizations tend to have a 
high degree of interaction. They also validate our design consideration and phase- 
ordering and provide some useful insights to further opportunities. (The legend shown 
in Fig. 6 applies to all the graphs for all optimizations discussed in this section.) 

3.2.1 Linear Loop Transformations 

Fig. 6 shows the interactions for linear loop transformations. This shows that these 
transformations interact significantly with other optimizations in HLO for the bench- 
marks 171. swim, 173.applu, 178.galgel, and 301.apsi. In 171. swim, linear loop trans- 
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formations interact with data prefetching. Loop interchange exposes array accesses 
with spatial locality that make data prefetching more effective. For 173.applu, there is 
interaction between linear loop transformations and loop fusion. In this case, two adja- 
cent loops could be fused only after loop reversal of the second loop. Interactions in 
178.galgel are described in detail in Section 4. Note that SOl.apsi shows negative in- 
teraction both at bottom (below y=0 in the graph) and top (above y=l in the graph), 
which is a manifestation of small performance loss due to loop interchange. The per- 
formance loss is due to differences in distances computed for data prefetching for unit 
stride and non-unit stride array references. 




Fig. 7. Interaction graph for loop fusion 



1.2 - 




Fig. 8. Interaction graph for loop distribution 



3.2.2 Loop Fusion 

In order to enable effective loop fusion, transformations like code motion, loop peel- 
ing, loop reversal, and extensive array section analysis are required. Fig. 7 shows the 
interaction for loop fusion. We observe that interaction is high for 171. swim, 
173.applu, and 179. art. There are two primary reasons for interactions in 173.applu - 
first, loop reversal enables more fusion as explained in the last subsection; second. 
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loop fusion enables many scalar replacements. Loop fusion enables scalar replace- 
ments and improves the effectiveness of data prefetching in 179. art. 

3.2.3 Loop Distribution 

Fig. 8 shows the interaction for loop distribution. 178.galgel has a very high interac- 
tion. As we will explain in Section 4, a combination of loop distribution, interchange, 
and unroll helps to considerably improve the performance of 178.galgel. But, loop 
distribution when applied alone, has no effect on performance. In 188.ammp, loop 
distribution helps software pipelining by partitioning loops into pipelineable and non- 
pipelineable sections. 179. art has a short run-time and the negative interaction in the 
graph is within experimental error. 

3.2.4 Scalar Replacement 

Interaction for scalar replacement of memory references (which also includes array 
contraction) is shown in Fig. 9. Scalar replacement of memory references, when ap- 
plied alone, improves performance for 172.mgrid and SOl.apsi.. In 172.mgrid, 
173.applu, 178.galgel, and 200.sixtrack, other transformations help scalar replacement 
to be more effective. In 173.applu, a large number of scalar replacements are enabled 
by loop fusion. In 200.sixtrack, loop unrolling enables more opportunities for scalar 
replacement in key loops. 







Fig. 9. Interaction graph for scalar replacement 



3.2.5 ILP Enhancing Techniques 

Interaction of loop unrolling, load-pair insertion and affine-condition unswitching is 
shown in Fig. 10. In 173.applu, loop unrolling enables loop fusion across large re- 
gions. In 183.equake loop unrolling enables loop fusion and prefetching. 200.sixtrack 
is a floating-point intensive code with many opportunities for extracting ILP. In this 
application, loop unrolling enables larger loops to be pipelined. Larger loops also help 
prefetching to be more effective. 
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Fig. 10. Interaction graph for loop unrolling and load-pair insertion 



3.2.6 Data Prefetching 

Interaction for data prefetching is shown in Fig. 11. It is interesting to note that data 
prefetching on its own benefits nearly all applications in the graph. Data prefetching 
also shows significant positive interaction with other optimizations. For example, bars 
for 183.equake, 171. swim and 173.applu show large improvement in performance due 
to positive interactions with other transformations. Loop unrolling, fusion, and 
blocking are key helpers. For example, unrolling of loops exposes larger inner loop to 
data prefetching. Note that the gains from prefetching for 173.applu are present only in 
the presence of other optimizations. Bar for 189.1ucas is dominated by the gain from 
prefetch alone and does not show interactions with other HLO optimizations. 
200.sixtrack does not benefit from prefetching, and there is a small degradation in per- 
formance at the bottom and at the top. This is because the data accessed fits in the 
cache, and prefetching only adds to the resource requirements without any noticeable 
benefit. The geomean for benefits from prefetching alone is close to 20%, and this in- 
creases to about 40% at the top (as shown in Fig. 2) due to significant positive interac- 
tions. 




Fig. 11. Interaction graph for data prefetching 
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Fig. 12. Influence of IPO and Profile feedback (PF) on HLO. (All=HLO+IPO+PF) 



4 Key Learning 

The design decisions discussed in previous sections were all related to the fact that we 
were targeting HLO for Itanium processor family. However, some of our other deci- 
sions were made because we implemented HLO in a production compiler. In a pro- 
duction compiler, minimizing compile time, memory usage, and maintenance costs is 
very important. In fact, while designing a production compiler, designers tend to 
forego an opportunity to improve performance in order to improve compile time, 
memory usage or maintenance efforts. For example, positioning affine-condition 
unswitching early in phase-order can enable more transformations. However, 
unswitching causes an update on the dependence graph that is expensive. We chose to 
position unswitching later in the phase-order, because we viewed the compile time in- 
crease a higher penalty than potential gains of moving it earlier in the phase-order. In 
contrast, we chose to rebuild rather than incrementally update dependence graph after 
certain sequence of transformations. This decision slightly increased the compile time. 
However, we estimated that the increase in compile time was better than the engi- 
neering cost of maintaining an incremental dependence update mechanism. 

Early in the design and implementation phase of HLO, we decided that the trans- 
formations needed only the high-level resource estimates - such as number of basic 
blocks. However, we learned that the transformations can be more effective with low- 
level resource estimates. We redesigned resource estimation to include an estimation 
of initiation interval of loops, number of registers, and whether a loop is likely to be 
software pipelined. This redesign proved to be key to many transformations including 
loop fusion, distribution, unrolling and insertion of load-pair instructions. 

Traditional compilers tend to do a maximal loop distribution followed by a loop fu- 
sion. We learnt that for certain engineering applications this can result in sub-optimal 
performance. We had to overlay the loop distribution heuristics to control distributions 
by distributing only at heuristically determined points. 

Load-pair instructions proved to be very important for certain engineering applica- 
tions. However, for certain other applications, load-pair instructions did not yield sig- 
nificant improvements as we expected. We learnt that for these applications saving 
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memory resources did not matter as much because the memory operations were incur- 
ring a higher latency than what was assumed at the time of scheduling. 

We believe that HLO communicates more information to the code generator com- 
pared to other high-level optimizers in the industry. This helped us tightly integrate the 
two components for higher overall performance. We made several decisions that 
helped communicate only the information that would be needed to minimize memory 
usage and compile time. For example, dependence information that can be easily de- 
duced from a symbolic memory disambiguator was not explicitly communicated via a 
dependence graph. 

In this paper, we did not discuss the impact of analysis beyond HLO. Effectiveness 
of optimizations in HLO is enhanced by inter-procedural optimization (IPO) and pro- 
file feedback [4], that are included in the SPEC base options. Their interactions with 
HLO are shown in Fig. 12. 

5 Concluding Remarks 

In this paper, we described design decisions made while designing and implementing 
HLO targeting Itanium processor family. We presented experimental results that vali- 
date the design decisions. The results showed a well designed high-level optimizer can 
have significant impact on overall performance. Such a design must consider at least 
the architecture-driven design considerations discussed in this paper. From the evalua- 
tion of the design choices we made, we can conclude that implementing the entire rep- 
ertoire of transformations is disproportionately more effective than a subset. Since 
HLO was implemented in a production compiler, we made certain engineering trade- 
offs. We discussed these trade-offs and outlined key learning derived from our experi- 
ence. 
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Abstract. Cache memories were invented to decouple fast processors 
from slow memories. However, this decoupling is only partial, and many 
researchers have attempted to improve cache use by program optimiza- 
tion. Potential benehts are significant since both energy dissipation and 
performance highly depend on the traffic between memory levels. But 
modeling the traffic is difficult; this observation has led to the use of 
heuristic methods for steering program transformations. In this paper, 
we propose another approach: we simplify the cache model and we or- 
ganize the target program in such a way that an asymptotic evaluation 
of the memory traffic is possible. This information is used by our opti- 
mization algorithm in order to hnd the best reordering of the program 
operations, at least in an asymptotic sense. Our method optimizes both 
temporal and spatial locality. It can be applied to any static control 
program with arbitrary dependences. The optimizer has been partially 
implemented and applied to non- trivial programs. We present experi- 
mental evidence that the amount of cache misses is drastically reduced 
with corresponding performance improvements. 



1 Introduction 

Technological advances in the realization of integrated chips result in faster clocks 
for processors, and in larger capacity for memory. In consequence, if nothing is 
done, processors will starve because their memory systems cannot supply data at 
the required speed. Memory hierarchies are a good solution to this problem: they 
are cheap and efficient, at least for ordinary programs and situations. Neverthe- 
less, their efficiency decreases dramatically for scientific computing and signal 
processing codes, where large data sets are accessed according to highly regular 
patterns. Next, their temporal behavior is difficult to predict; this forbids their 
use in systems with hard real time constraints. Lastly, moving data from level to 
level uses a lot of power, which renders them unsuitable for embedded systems. 

A lot of work has been devoted to improving the behavior of memory hierar- 
chies. There are two kinds of approaches for this problem. The first approach con- 
sists in designing highly optimized libraries (L APACK is a good example pQ ) for 
the most common linear algebra and signal processing algorithms. This method 
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often gives the best results, provided the source problem and the target architec- 
ture are within the scope of the available library. The second approach tries to 
optimize the source program at compile time. This method is not restricted to 
a given set of algorithms and can be adapted, with minor modifications, to any 
memory hierarchy architecture. The present work belongs to the later approach. 

Most optimizing compilers try to transform the source program in order to 
improve the behavior of the memory hierarchy. The basic principle is to regroup 
all accesses to a given memory cell, in order to take a maximum advantage of 
possible reuses. This is obtained first by applying loop transformations rmn 
according to some cost model [13], then by tiling the resulting loop nest |16] 
with tiles having a carefully chosen size [4]. Basically, this method applies only 
to perfect loop nests in which dependences are non-existent or have a special 
form (fully permutable loop nests). Another data-centric approach starts 
from a memory cell and tries to build the slice that accesses this cell. Here 
again, dependences greatly complicate the transformation process. 

As said above, previous methods require most of the time severe limitations 
on the input program. Our work can be applied to a wide application domain 
since we do not lay down any requirement on dependences provided that the pro- 
gram has static control |5]. This program class includes a large range of problems 
which are discussed in depth by Xue PH. The properties of such programs can be 
summarized in this way: (1) control statements are do loops with affine bounds 
and if conditionals with affine conditions (in fact control can be more complex, 
see [E|); (2) arrays are the only data structures, and their subscripts are affine; 
(3) affine bounds, conditions and subscripts depend only on outer loop counters 
and structure (or size) parameters. 

All methods mentioned earlier are based on a heuristic cost model. Let us con- 
sider for instance two accesses to the same memory cell. It seems probable that 
the longer the time interval between these accesses is, the higher the probability 
of the first reference to be evicted from the cache is. Hence, loop transformations 
aim at moving these references to neighboring iterations of some innermost loop. 
Our technique is based on an estimate of the memory traffic, and tries to find 
the loop transformation that minimizes this estimate, under the constraint that 
all dependences are satisfied. This technique, which we call chunking is presented 
in section El Section E] explains how to construct good chunking functions for a 
given program. Section El deals with the problem of code generation when the 
chunking functions are given. Section El describes our implementation and ex- 
perimental results. Section compares chunking to other approaches. We then 
conclude and discuss future work. 

2 Chunking 

The principle of our method is to partition the set of operations of a program in 
subsets small enough that their accessed data fit in the cache: the chunks. The 
program is then executed chunk by chunk, as if there was a cache ffush between 
each of them. These subsets must be such that their sequential execution is 
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equivalent to the execution of the original program. In practice, chunks will 
be numbered then executed in order of increasing numbers. A chunk number 
will be assigned to each operation, i.e. to each instance of each statement. In 
other words, for each statement S we seek a chunking function Os associating 
a chunk number Os{x) to each iteration vector x. The original operations will 
be rescheduled accordingly to these chunking functions. We present in figure [T] 
an example of chunking of a simple program. We assume as input hypothesis 
that n array elements can fit in the cache, but m cannot. Such a simple code yet 
exhibits several difficulties: non-perfect loop nest, dependences between different 
statements, parameters and multiple references. In this example, the order of the 



do i=l, n 

a(i) = i ! SI 

do j=l, m 

b(j) = b(j) + a(i) ! S2 

enddo 
enddo 



(a) source program 



dsi ([*]) = [*] ;^S2 




[j + n] 



(b) chunking functions 
do c=l, n 

a(c) = c ! SI 

enddo 

do c=n+l, n+m 
do i=l, n 

b(c-n) = b(c-n) + a(i) ! S2 

enddo 
enddo 



(c) target program 



Fig. 1. Running example 



operations has been modified for a maximal use of temporal locality, according to 
the chunking functions in figure [TJb). In the target program, c gives the number 
of the current chunk. This example will be used for illustration throughout this 
paper. It can be noticed that the code can be restructured in the same way by 
conventional loop distribution, loop permutation and skewing. Chunking is set 
in the framework of the polytope model and every chunking can be broken down 
in a succession of well known transformations. In fact, chunking does not aim to 
find new transformations but to find the right transformation automatically. 
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3 Computing Chunking Functions 

The quality of a chunking can be assessed by using two valuations. First, the 
footprint size which is the number of memory cells accessed by the operations of 
a chunk. Next, the traffic which is the number of data movements between main 
and cache memories. We want to build an optimal chunk system i.e. where each 
chunk footprint fits in the cache and the traffic is minimal. To be able to generate 
the target code, we are looking for affine chunking functions. Subsequently, for 
an operation S[x], instance of the statement S with the iteration vector x in the 
iteration domain Ds^ the chunk number can be written: 



Os{x) =Tsx + ks- 

Ts is a matrix called the chunking matrix; its dimensions are g x p{S) with p{S) 
the number of loops surrounding S. The choice of the value of g is postponed till 
section HOI ks is a constant vector. Chunking functions are calculated in several 
steps which are discussed in the next sections. In section 13.11 we show how to 
compute an asymptotic evaluation of the traffic with respect to the chunking 
functions. Then we exhibit the constraints that the chunking functions must 
satisfy to minimize the traffic. Section explains how to find all the functions 
verifying such constraints. Section ESI shows how to choose the functions in such 
a way that the transformation is legal for dependences. Lastly, section [3^ and 
13.51 gives respectively the constraints which have to be satisfied by the chunking 
functions in order to achieve group-locality and spatial- locality. 



3.1 Asymptotic Evaluation 

It is hard to find an accurate solution to the traffic evaluation problem for a 
particular cache type. Modeling the replacement mechanism is quite difficult, 
but it is bypassed by chunking. However, several difficulties remain, hence we 
propose the following simplifications on our cache and memory models: 

— conflict misses do not change the order of magnitude of the traffic; this as- 
sumption is satisfied by fully associative caches and is close to be satisfied 
by modern caches with high associativity; most discrepancies can be com- 
pensated by using an effective cache size smaller than the real one; 

— we will be satisfied with asymptotic evaluation of the traffic; in many cases, 
program transformations can change the order of magnitude of the traffic, 
then it would be useless to fiddle with constant factors or worse, units in 
the last decimal place; in some cases, i.e. when self-reuse has already been 
exploited, only the constant factors can be improved; the question of deciding 
if a more precise evaluation can influence the target code is left for future 
work. 

In our model, it is possible to make estimates of footprint sizes and traffic 
with respect to the chunking functions. Considering a statement S', an array A 
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and a subscript function /, the footprint generated by this reference is the set 
of memory cells accessed during the chunk execution: 

= {f{x) I X e Ds,0s{x) = t]. (1) 

Let us suppose that the cache is empty at the start of a chunk and that its 
footprint fits in the cache. Then any cells in the footprint is copied once to the 
cache at some time during the execution of the chunk and stays there until the 
termination of the chunk. Hence the traffic can be estimated as the number of 
pairs (data, chunk number): 

Ts,aj = Card {{f{x),0s{x)) \ x e Ds}. (2) 

Since input programs have static control, subscript functions are afhne and 
can be written f{x) = Fx + a, where F is the subscript matrix of dimension 
p{A) X p(5'), with p{A) the dimension of array H, and a a constant vector. 

Theorem 1. Let H = [Ux \ Vx = 0,x G D} he a set where U and V are 
arbitrary integral matriees of the right dimension, and where D is a hounded full 
dimensional domain sueh that the value of eaeh eomponent of the veetor x is 
an integer in a segment of length m. Then Card H is of the order of w} with 

i=r»k (^)-rankV'. 

Proof Let us first study the dimension of the subspace K = \^Ux \ Hx = 0}. 
This corresponds to the rank of the application / from ker V to Im U that 
associates Ux to x. According to a well known algebraic theorem, we have 
dim ker V = rank / + dim ker /. As ker / = ker U Pi ker V, it follows: 

rank / = dim ker V — dim (ker U D ker V). 



Since D is such that the value of each component of x is an integer in a segment 
of length m, it follows that each component of Ux also is integral and belongs 
to a segment of length proportional to m. Hence, the size of H is of the order of 
mK Since dim ker V + rank V = number of column of V, we have finally Card H 

is of the order of with I = rank — rank V. ■ 

The orders of magnitude of the cardinals of sets describing footprints o and 
traffic J2j) are directly given by theorem [U The asymptotic size of footprints are 
found with H as T and U as F, and considering the traffic, with V as the null 

matrix and U as the block matrix ( 3^ | composed of the matrix T for its first 



rows and of the matrix F for the next rows. If the value of each component of x 
is an integer in a segment of length m, we have: 



Card Fs,Aj{t) = O (m^) , with I = rank J — rank T, 
Ts,aj = O (m^) ,with k = rank • 
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These evaluations depend on F which can be extracted by analysis of the source 
code and T which is the unknown of the problem. Thus we can find the con- 
straints that T has to satisfy in order that the footprints fit in the cache and the 
traffic is minimal. 

Let us consider one statement with n array accesses, the subscript matrix of 



sponding to the possible sets of constraints can be enumerated. We need to know 
the cache size C and an estimate of the size parameter m. We then determine 



the cache if < a. We can thus eliminate all tuples for which this condition is 
not satisfied, and we can rank the remaining ones in order of increasing traffic. 
It then remains to try building a T which satisfies the rank condition of the best 
tuple. If this is proved to be impossible, we start again with the next tuple. 

3.2 Building Chunking Matrices 

Thanks to the evaluations, we know which rank constraints must be satisfied by 
the chunking matrices to minimize the traffic. In this section, we show how to 
build such matrices, at first when the corresponding statement includes only one 
reference. Then, we show that there always exists a chunking matrix such that 
each associated footprint fits in the cache. 

For a statement S with one reference, it is always possible to find a matrix T 

such that rank T = v and rank = re, provided that v and w have com- 

patible values (i.e. p{S) > w > v). The building process is described by the 
algorithm in figure [2 From the returned matrix T, we can generate the set of 
matrices with the required properties: the set of CT matrix where C is a ma- 
trix of full row rank. We will choose in this set the matrices in order to satisfy 
additional constraints described in section 13.31 and 13.41 

Let us demonstrate that this algorithm builds a matrix T that answers the 
requirements. Since the matrix T is composed of v linearly independent rows, the 
constraint rank T = n is satisfied. These rows are those of G~^ from p(S') — re + l 
to p{S) — w F V. Hence, the kernel of T is generated by the column vectors of 

G from 1 to p{S) — w and from p{S) — w v I to p{S). The kernel of 



is the intersection of the kernel of T with the kernel of F, hence it is generated 



is satisfied. As for the choice of g, the number of rows of T, it is clear that 
bordering a matrix by null rows does not change its rank. Since when reordering 
the program it is useful to have all chunking function of the same dimension, we 
may take g = maxp(5'). 

The generalization to n references implies the combination of n constraints: 




corre- 



an integer a such that < G. A footprint component of size O (m^") fits in 




by the p{S) — w first column vectors of G and the constraint rank 




= w 



rank 




= Wi for 1 < i < n. The matrix G must have for each reference 
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Construction Algorithm: Build a matrix under rank constraints. 



Input: the subscript matrix F and the rank constraints rank T — v and 




Output: a matrix T respecting the rank constraints. 

1. Compute a basis of kerF and complete it to a basis of 

2. Let G be the matrix of these vectors (vectors added to complete to a basis of 

are the last columns). 

3. Compute inverse of G. 

4. Build matrix T: 

a) For i from 1 to 

row of T = (p{S) -wF row of G~\ 

b) Complete T with null rows. 



Fig. 2. Construction Algorithm 



exactly p{S) — Wi vectors of a basis of ker Fi for a total of at most v vectors. 
Such a matrix does not always exist. The choice of vectors to be included in the 
matrix G is essential. We can guide this choice by adding for each reference as 
many vectors from a preceding reference as possible. If a solution does not exist 
for a tuple, then we try to find another one for the next more interesting tuple. 

A chunking matrix such as each footprint fits in the cache always exists. 
The hardest constraint for the footprints is to have a size in O (m^), and the 
last tried possibility will be the tuple i^p{S)^Wi = p{S) for 1 < i < n). The 
corresponding chunking generates for the reference footprint sizes of O 

and the maximal traffic of O Its solution T = Id always exists and is 

the trivial chunking where there is one chunk per operation. 

Example 1. Let us consider the source code in figure [H We assume that a is an 
array of n cells which fits in the cache and b is an array of m cells which does 
not fit in the cache. Then, the acceptable orders of magnitude for the footprints 
size are O (n^) and O (m^). The program has two statements: 

— the statement SI has just one reference to the array a with the index matrix 
^ 51, 1 = [l]; the matrix Tsi having the best properties corresponds to the 
tuple (1, 1), it will generate footprint sizes of O (n^) and a traffic of O (n^); 
the algorithm builds Tsi = [ 1 ] ; 

— the statement S2 has two references, the first one to the array a with the 
index matrix Fs 2 ,i = [l O] Ihe second one to the array b with the 
index matrix Fs 2,2 = [O l]? Ihe matrix Ts 2 having the best properties 
would correspond to the tuple (1,2, 1), it would generate footprint sizes of 
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O [rnP + and a traffic of O {m} + n^); the construction is possible and 



gives Ts 2 = 



0 1 
00 ' 



3.3 Legality 

Since chunking reorders operations, it must satisfies dependences. In this sec- 
tion, we explain how chunking functions can be chosen in such a way that the 
transformation satisfies dependences. We will show that there always exists a 
valid solution which satisfies the constraints described in previous sections. 

Chunks are numbered in the order they will be executed, and inside each of 
them, operations are executed in the original sequential order. Let us consider 
Ip, the statement set of the program and the dependence relation on V; 
a chunking is legal if and only if: 



V5, R e Ir, SrR[y] => es{x) < eniy). (3) 



There is no a priori reason for to be satisfied by the chunking matrices 
as constructed by the algorithm in previous section. However, we are free to 
modify them as long as we do not change their rank properties. We are also 
free to adjust the constant vectors /c, as they have no impact on the footprints 
and traffic (at least asymptotically). Thus, for any statement S', the chunking 
function can be written 

Os{x)=CsTsx^ks, 

where Cs is a matrix of full row rank. We use the Farkas algorithm to solve m 
and to find the set of all Cs and ks- If the problem has no solution, we declare 
a failure and try the next best traffic/footprint combination. 

A legal solution such as the footprints fit in the cache always exists. It cor- 
responds to the worst solution, in which all the chunking matrices are identity 
matrices. In this case, the original program is not modified. This possibility must 
always be left open, since it might happen that the source program is already 
optimal. 

Example 2. Let us continue the example of section [XU The chunking functions 
associated to the proposed matrices are: 



Osi ([^]) = [1] [i] + [0] = [i] ; 0 s 2 




■q r 




i 


00 




J _ 



0 

0 



j 

0 



These functions do not describe a valid chunking: the dependence from SI to S2 

2 

is not satisfied. For instance, the operation S2 



is executed in chunk number 



1 whereas the operation S'! [2] on which it depends is executed later, in chunk 
number 2. Our method makes it possible to correct this chunking so that all the 
dependences are respected and the quality is preserved. The correction suggested 
by our prototype is the following one: 



i 

j 



'0 r 




i 


00 




J _ 



n 

0 



j + n 
0 



Osi ([*]) = [1] [*] + [0] = [i] ; 9s2 
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To homogenize the chunking functions, one can add null dimensions, or remove 
them if they are null for all the functions, since this does not change the ranks. 

i 



We have finally Osi (['^]) = ['^] and 0 s 2 



= [j + n]. 



3.4 Group-Reuse 

There is group-reuse when two statements, SI and 5'2, access the same array 
A through indexing matrices Fi and F2 (for the sake of readability, we will use 
homogeneous coordinates in this section). There is reuse if there exists iteration 
vectors xi and X2 such that F2X2 = FiXi, and this reuse is exploited if these 
two operations are in the same chunk: 



VxiVx2, F2X2 - Fixi = 0 ^ T2X2 - Tixi = 0. (4) 

Observe that this constraint has the same shape as a dependence constraint. 
If F2X2 = TiXi, then Sl[xi] and S2[x2] are in dependence. This dependence 
may be a read-read dependence, which may not be taken into account in other 
circumstances, but which exists nevertheless. As to the right-hand side of ([H), it 
is similar but more restrictive than the right-hand side of ([3]). As a consequence, 
we can give a more precise result: 

Theorem 2. is true iff (T2 — Ti) = N(^F2 — Fi) where N is a matrix of 
full row rank. 

Proof Let x be the concatenation of vectors xi and X2. Formula © can be 
written 

Vx, (F2 - Fi)x = 0 ^ (T2 - Ti)x = 0. 

(F2 -Fi)x = 0 and (T2 — Ti)x = 0 describe two sets where one point belonging 
to the first one necessarily belongs to the second one too. Therefore the first one 
is a subset of the second one. So it can be written as the second one with b 
additional constraints: 

(F, = 

then ~ M[F2 — F\) with M a matrix such that det M 7^ 0 

(the system is not modified by linear transformations). Let us write M as 
^ where N' is the matrix made with the b last lines of M. Now we have 

Q ( V ) F2 - ^1) and finally (T2 - Ti) = N{F2 - F^). , 

The unknowns are the entries of N, which define the linear transformations 
to apply to (F2 — Fi) in such a way that the chunking functions respect the de- 
pendences. This is clearly the same problem as the correction for dependences in 
section fXHl We solve them at the same time, by adding the necessary constraints 
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(a set of constraints by pairs of references in which group-reuse is detected) to 
the initial problem. This theory, which does not assume that group-reuse is as- 
sociated to constant dependences, can even be used for “self- group-reuse” , when 
the two accesses to A are in the same statement. Here, we deduce from @ that 
the linear subspace G = {x2 — xi\FiXi — F2X2 = 0 } is included in the kernel of 
T = Ti = T2.. It is easy to find a basis for G by gaussian elimination techniques. 
The resulting vectors can be taken into account when building the chunking 
matrices. Improving group-locality do not change the order of magnitude of the 
traffic. It can divide the traffic generated by n references by a factor of n. 

Example 3 . Let us consider the source code in figure | 3 ]( a) . All control centric 



do i=l, n 
do j=5, n-10 

C(i,j) = A(i,j-5) 
D(i,j) = A(j+10,i) 
enddo 
enddo 




Zone accessed by S 1 



Zone accessed by S2 



(a) sample code 



(b) Accessed zones of A 



Fig. 3. Example of group reuse 



methods will estimate that there is no self reuse and no exploitable group-reuse. 
The reason is that they fail to consider non uniformly generated references (uni- 
formly generated references are such as their subscript functions differ in at most 
the constant term ED- In fact there is good reuse between the two statements 
for a part of the array A as shown by the figure | 3 ](b). In this example, there is no 
dependence, then we can use the trivial solution of (T2 — Ti) = N (^F2 — Ti), 
that is Ti = Fi and T2 = F2. Therefore, the chunking functions will be : 



Osi 





j + 10 



i 



This transformation leads to the target code below. The group-locality is now 
maximal: in the shared zone of A, the two statements access the same memory 
cell during the same iteration. 

do cl=l, 14 
do c2=0, n-15 

C(cl,c2+5) = A(cl,c2) ! SI 

enddo 
enddo 

do cl=15, n 

C(cl,5) = A(cl,0) 



! SI 
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do c2=l, n-15 






C(cl,c2+5) 


= A(cl,c2) 


! SI 


D(c2,cl-10) 


= A(cl,c2) 


! S2 


enddo 






do c2=n-14, n 






D(c2,cl-10) 


= A(cl,c2) 


! S2 



enddo 

enddo 



3.5 Spatial-Reuse 

There is spatial reuse for a reference if it accesses data on the same cache line 
during different iterations. As for group locality, improving spatial locality do not 
change the order of magnitude of the traffic. It can divide the traffic generated 
by a reference by a factor of d, where d is the cache line length in words. Spatial 
locality is achieved if the operations accessing the same cache line are in the same 
chunk. Let us consider a reference to an array A with the subscript function F. 
Let i be the number of the major dimension of A, i.e. the dimension with data 
lines ordered successively in memory. Then spatial locality is achieved for A if 
the operations accessing the memory cells of the major dimension are in the 
same chunk. In other words, spatial locality is achieved if G ker T. 

This constraint is added in the T construction algorithm seen in section |3^ 
by asking for a more accurate choice of vectors to be included in the matrix G. If 
the new constraint prevents the construction of T, we can try with another line of 
the subscript function and suggest the corresponding data layout transformation. 
This result can be compared with the Kandemir et al. method [8], where both 
loop and data transformations are used to improve spatial locality. Chunking 
does not require a non-singular transformation matrix, but it can achieve spatial 
locality only for a given loop level. However, in practive results are often alike. 

4 Code Generation 

Code generation is the last step to the final program. It is often ignored in spite 
of its impact on the target code quality. We must ensure that a bad control 
management does not spoil performance, for instance by producing redundant 
guards or complex loop bounds. An outline of the resulting code is a loop on the 
number of chunks L which contains the chunk operations. If the chunk numbers 
are vectors, we have as many surrounding loops as chunking dimensions. 

Because the input problem is a static control program, the bounds on state- 
ment iteration spaces can be specified by a set of linear inequalities defining 
a polyhedron m- In the chunking case, we change the scanning order of this 
polyhedron by substitution of the original dimensions by chunking dimensions. 
The code generation is then a well known Z-polyhedron scanning problem. At 
present, the best solution is the Quillere et al. one M- Their method is well 
adapted to the chunking problem provided we generalize it somewhat. We have 
implemented an extended version, CLooG, which can handle sequential inner 
loops and imperfect loop nests. Our resulting code is quite efficient. 
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5 Experimental Results 

We are implementing our approach in the chunk'^ source-to-source optimizing 
tool. This prototype implements at present the process from the chunking func- 
tion calculation to the code generation, but without group and spatial locality 
improvement support. This prototype already allows us to present preliminary 
results for some important non-trivial problems. The experiments were con- 
ducted on a PC workstation with a Pentium III processor running at IGHz. 
This processor comes with two cache levels: a split first level (LI) for instruc- 
tions and data of 16KB each and an unified second level (L2) of 256KB. Figure 
|4] shows the evolutions of the number of cache misses observed with hardware 
counters for the original and target versions of the running example (see figure[T]), 
according to the value of the parameter m. 




m : array dimension (words, log scale) 

Fig. 4. Cache misses for the running example 



The ratio m/n is set to 64 in order to better show the impact of our method. 
The number of cache misses sharply grows when the array b becomes larger 
than a cache level in the original program. The chunked program has a better 
behavior. The miss growth comes later, when the input hypothesis are no longer 
satisfied, i.e. when the array a cannot fit in the cache. We have observed the same 
phenomenon on most of the programs with good data reuse we have tested. Some 
experimental results on well known problems are shown in figured The compiler 
option was 03 for the original programs, but 01 for the transformed programs in 
order to prevent any compiler optimization that can disturb the chunking. As for 
the running example, chunking can reduce the number of cache misses by more 
than one order of magnitude. This cache miss reduction can imply a significant 
performance improvement. The speedup is better with big problems. Since the 

Parts of Chunky are freely available at http : / / www . prism . uvsq . f r/ ~cedb 



1 
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problem 


array size (words) 


missdown (%) 


speedup (%) 


running example 


I6K 


99.1 (LI) 


7 


IM 


99.9 (L2) 


427 


LU decomposition 


80 * 80 


79.3 (LI) 


2 


256 * 256 


84.1 (L2) 


43 


Cholesky factorization 


80 * 80 


70.3 (LI) 


2 


256 * 256 


85.5 (L2) 


46 


Gauss- Jordan 


80 * 80 


70.2 (LI) 


-13 


256 * 256 


93.1 (L2) 


26 



Fig. 5. Experimental results 



miss penalty for an L2 miss is of the order of 10 times an LI miss, these results 
are not surprising. The situation of Gauss- Jordan for 80 * 80 arrays shows how it 
is necessary to avoid control overheads. In this (rare) case, despite the attention 
given to code generation and a significant cache miss reduction, our method fails 
to improve performance on small problems. The point of view is quite different 
when the critical resource is energy, like in embedded systems. Cathoor et al. [3] 
show that data movements in the hierarchy is one of the main cause of energy 
consumption. In this case, a cache miss reduction is always a benefit. 



6 Related Work 

The effort of research to create effective locality optimizing compilers began with 
Wolf and Lam m and their data locality optimizing algorithm. This algorithm 
applies unimodular transformations to loop nests in order to maximize locality, 
according to evaluations of legal loop transformations relevance. Then it applies 
tiling [16] to the innermost loops. In comparison, our approach is applicable to 
a wider range of programs since in one hand we do not require perfect nests 
or nests such as they can be made perfect. And on the other hand because 
we do not require that dependences must have any simplified shape (Wolf and 
Lam algorithm needs that the dependence vectors be lexicographically positive). 
Moreover, to make perfect loops and to tile imply severe control overhead while 
we minimize it thanks to an accurate code generation method. 

Li HD generalizes the framework of unimodular matrices [2] by using linear, 
non- unimodular transformations to change the iteration space. We expect our 
algorithm will find more accurate transformations in practice since Li’s trans- 
formation and dependence types are quite simple: the transformations do not 
handle parameters and the only case discussed is the one where dependences are 
represented by distance vectors. 

McKinley et al. m propose a technique based on a detailed cost model 
that drives the use of loop permutation, fusion and distribution. They apply 
the basic transformations according to a definite order, while this strategy can 
be ineffective for some problems. To find which is the best application order of 
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the transformations for a given program is known to be very hard. Chunking 
bypasses this difficulty because it unifies all kind of linear transformations in 
a single framework. For group-reuse, McKinley et al. consider the classic case 
of uniformly generated referenees [7], with small restrictions. We propose to go 
beyond this case by optimizing group-locality between non uniformly generated 
references when they are in different statements. In compensation, chunking 
processing is heavier than the McKinley et al. algorithm. 

Alternatively to these control centric techniques, Kodukula et al. [9| propose 
a data centric approach that plans to act on data movement directly, rather 
than as a side-effect of control flow manipulations. Our work shares many fea- 
tures with 1^. Both papers are set in the framework of the polytope model, and 
aim at partitioning the code in pieces which are (almost) free of cache misses. 
Both techniques transform the code by well known transformations (loop ex- 
change, loop skewing . . .): the problem is not to invent new transformations, 
but to find the right transformation for a given program. There are however 
several important differences. Kodukula et al. start from the following intuition: 
once a datum has been brought into the cache, it is beneficial to execute all op- 
erations which access this datum. Our approach is different since we start from 
an estimate of the traffic and try to minimize it. In both cases we have to find a 
transformation legal for dependences. But while Kodukula et al. can just check if 
their transformation respects dependences, we have integrated the legality in the 
transformation construction. Lastly, while Kodukula et al. use an arbitrary array 
blocking, we show that significant improvements can be obtained without block- 
ing. Testing whether blocking can improve our results is left for future studies. 

7 Conclusion 

In this article, we have presented a method based on traffic evaluations for data 
locality improvement. It exhibits many advantages. First of all, the computed 
solution always fulfills the memory requirements imposed. Next, it can be ap- 
plied to any static control slice of a program. Lastly, there is no requirement on 
dependences and we compute the space of all legal transformations directly. The 
method requires nothing besides the original code but the relative sizes of the 
cache and data. 

First results are very encouraging and make us believe that our technique is 
a new significant way to achieve data locality automatically for a large amount 
of problems. Moreover, chunking seems to be well adapted to several extensions 
and we plan to obtain even better theoretical and practical results. We are 
currently working on tiling which seems to be the natural continuation of our 
approach. Intuitively, tiling is a question of aggregating small chunks or splitting 
big ones. We are also working on a more accurate solution for spatial locality 
improvement. A step in that direction is the work of Loechner, Meister and 
Clauss m, which is based on precise counting of memory accesses. Lastly, we 
must deal with programs which have static control regions but do not have static 
control in toto. Locality optimization have the nice property that there is no 
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need of applying it to far away statements, since the hope of having reuse in this 
situation is very small. Hence chunking can be applied locally, i.e. to loop nests 
or small subroutines, and there is no danger of an excessive compilation time. 
Our method can be adapted to local memories (or software managed caches) at 
the price of more attention to footprint layout. 
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